Recently,
I've had some folks ask me offline what exactly would a “complete”
Linux cluster stack look like. That's a good question, and this
posting is intended to address that question.
So
let's start with – what kind of cluster? For the purposes of this
posting, I'm primarily talking about a full-function
high-availability enterprise-style cluster, not primarily a load balancing cluster,
and not
a high-performance scientific (Beowulf-style) cluster.
A
few caveats before proceeding – much of what I'll reference below
will be relative to the Linux-HA[1]
framework,
but the concepts are easily translated to any other clustering
framework one might have in mind.
It's
also worth noting that not every application, nor every configuration
needs every component. Adding unnecessary components adds
complexity, and complexity is the enemy of reliability.
Many
of the components (cluster filesystem, DLM) are primarily needed by
cluster-aware applications. Note that at this time (early 2008) very
few applications are cluster-aware.
The Full Cluster Stack Exposed!
Below is a picture of the full cluster stack –
which I'll describe in more detail later. For the most part, the
components higher up in the picture build on the components lower
down in the picture. To simplify the drawing, I didn't add all the
who-uses-whom lines that one might want to make a detailed study of
this subject.
Cluster Comm - Intracluster communications
The most
basic component any cluster needs is intracluster communications.
There are a variety of different possibilities, but guaranteed packet
delivery is a requirement. Linux-HA has its own custom comm layer
for doing this. It's not perfect, but it works. At one time in the
past, we provided support for the AIS cluster APIs, and if you use
OpenAIS today, then you can still have a reasonable cluster using
Linux-HA and providing compatible support for the AIS protocols. As
will become clear, it's not a perfect configuration, but it's a
reasonable one. (Of course, like everything, it can always be improved even more)
Very
large clusters (hundreds to tens of thousands of nodes) will likely
require a different communication protocol, since most guaranteed
delivery multicast protocols don't scale that high.
Nevertheless,
in an ideal world, all cluster components and cluster-aware
applications would sit on top of the same set of communications
protocols.
Membership – who's in the cluster?
Looking
to the right of the cluster comm box on our architecture chart,
you'll see the membership box. The next basic function that a
cluster has to provide is membership services. Membership closely
related to communication – since a simplistic view of membership is
just who we can communicate with. It is highly desirable that
everyone in the cluster be able to communicate with everyone else.
It's the job of the membership layer to provide this information to
the cluster.
When your
communication fails in weird ways, it's the job of the membership
layer to present a view of the cluster that makes sense – in spite
of the weird kinds of failures that might be going on.
If we
eventually wind up with multiple kinds of communications methods,
then we'll also have multiple ways of becoming a member.
Linux-HA
(with or without OpenAIS) supports the AIS membership APIs.
Like I
mentioned for communication, in an ideal world, all cluster
components would use exactly the same membership information. However, it is important to note that the membership one uses must be computed using the communication method being used by the application. So, unless every cluster-aware application uses both the common communication method and the common membership, it risks getting its membership out of sync with respect to its communication and other components using other communication methods. In many cases, this can't be avoided. Methods for coping with such discrepencies are discussed in more detail at the end of this post.
Fencing
Fencing
is the ability to “disable” nodes not currently in our membership
without their cooperation Many of you will remember having discussed
this in some detail in an earlier
post[2].
As I explained in more detail there, fencing is vital to ensure
safe cluster operation
Our
current implementation is STONITH[3]-based
- STONITH == Shoot The Other Node In The Head
Quorum
Quorum
is a topic we've talked about extensively in a couple of earlier
posts [2]
[4].
Quorum encapsulates the ability to determine whether the cluster can
continue to operate safely or not
Quorum
is tied closely to both fencing and membership. In practice, as we
discussed before, it is often highly desirable to implement multiple
types of quorum. Linux-HA
currently provides multiple implementations and can provide more
through plugins. Like membership and communication, it is desirable
for all cluster components to use the same quorum mechanism. All the interesting and legal ways that quorum can interact with fencing and membership and the communication layer are too detailed for this posting.
Cluster Filesystems
Cluster
filesystems allow multiple machines to sanely mount the same
filesystem at the same time. This is a great boon to parallel
applications. Cluster filesystems typically
don't use the network or another server involved when doing bulk I/O.
Each node mounting the filesystem is normally expected to have
access to the data. This typically requires a SAN.
Typically,
this is done to improve performance, but convenience
and manageability are common secondary goals Cluster filesystems are
related to, but distinct from, network filesystems like NFS and CIFS.
On Linux, there are several cluster filesystems available including
Red
Hat's GFS
Oracle's
OCFS2
IBM's
GPFS
etc.
Normally,
when they're being used for performance reasons, cluster-aware
applications are required. You can't typically just run 'n'
copies of your favorite cluster-ignorant application and have it
work. The filesystem won't scramble the data, but your application
typically will. It goes without saying that high-performance cluster
filesystems run in the kernel – unlike all the other items we've
talked about before. Because of the high-performance, it's common
for a cluster filesystem to have its own communication and membership
code – not using the typically userland communications and
membership code. Since membership isn't high bandwidth or really low latency information, it is possible to feed membership from a user-space membership layer into the kernel. Of course, then the membership and the communications layer are out of sync. It is arguable whether this is an improvement or not.
Cluster
filesystems typically need cluster lock managers – described in the
next section.
DLM - Distributed Lock
Manager
A
DLM (Distributed Lock Manager) provides locking services across the
cluster, and it's an interesting piece of code to implement them –
particularly the error recovery.
To
some degree, DLMs are analogous to System V semaphores but –
cluster-aware. In addition, they provide much
more sophisticated API and semantics. Although DLM APIs are fairly
well understood, there is no formal standard, so switching from one
to another can be annoying. Red Hat has a reasonable kernel-based
DLM which they use with GFS. DLMs commonly have their own separate
communications and membership code. The comments about getting membership from user-space and having them be potentially different from cluster filesytems also apply here.
Cluster
Volume Managers
You
might think that you really don't need a cluster-aware volume
manager. Sometimes you might be right. More often, if you thought
that, you'd be wrong. A cluster volume manager is just like a
regular volume manager – only cluster aware. This is to keep
different nodes from getting inconsistent views of the layout of a
set of disks or volumes. The current cluster-aware volume managers
are EVMS and CLVM. Only CLVM is expected to survive into the long
term.
The
big challenges for cluster volume managers are high-performance
mirroring and snapshots. These operations are potentially very
difficult to implement right and fast. Cluster-aware volume managers
often have both kernel and user-space components. The membership inconsistency issues here are similar to those for cluster filesystems and the DLM.
CRM - Cluster Resource Manager
Every
HA cluster has something like a CRM, but they may divide up these
functions differently. Our CRM is a policy-based decision maker for
what should run where – handling failed services and failed cluster
nodes.
The
CRM is similar to UNIX/Linux startup init scripts – it starts
everything up – but across a cluster following some policies, and
managing failures.
The
Linux-HA CRM is arguably the best cluster resource manager around
today – at least in terms of flexibility and power. It has usability issues, and can be extended, but those are solvable.
The
Linux-HA CRM function is largely divided between the PE and TE –
which are described below.
PE - Policy Engine
The
Policy Engine is a key component of the CRM and does two distinct
things.
This
graph of actions is then given to the TE (described below).
The
system would have more flexibility if he PE were split into two parts
for these two functions, and supported plugins for the cluster layout
function.
It
currently isn't aware of resource cost, nor of absolute resource
limits and load balancing considerations, which complicate optimal
placement. Those would be good things to add to it in the future. Having plugins for doing resource placement would also be a highly useful and desirable thing.
TE - Transition Engine
Receives
a graph of actions to perform from the policy engine, then uses
the LRM proxy to communicate with the LRMs to carry out the actions
Its main
jobs are action sequencing, error detection and reporting
CIB - Cluster Information Base
The CIB
manages information on cluster configuration and current status. The
cluster configuration includes the configuration and policies as
defined by the system administrator.
Its key
difficulty is to keep a consistent copy replicated across the
cluster, resolving potential version differences.
All the
data it manages is XML, and the CIB has a minimal knowledge of the
structure of this XML.
LRM - Local Resource Manager
In
the Linux-HA architecture, a local resource manager runs on every
machine and carries out the tasks given to it. Everything
that gets done gets carried out by the LRM. Examples are:
start
this resource
stop
this resource
monitor
this resource
migrate
this resource
etc.
The LRM
provides interface matching to the various kinds of resources through
Resource Agents. The Linux-HA LRM
supports several classes of Resource Agents.
The LRM
is not at all cluster-aware. It can support an arbitrary
number of clients, one of which is the LRM communications proxy
(below).
LRM Communications Proxy
The LRM
proxy communicates between the CRM and the LRMs on all the various
machines. This function is
currently built into the CRM. This architectural decision was based
on expedience more than anything else.
To
support larger clusters this needs to be separated out, made more
scalable, and more flexible. This would allow a large number of LRMs
to be supported by a small number of LRM proxies. In large systems,
this would probably use the
ClusterIP capability to provide load distribution (leveling) across
multiple LRM proxies.
Init - Initialization and recovery
This
code does really three things:
Sequences
the startup of the cluster components
Recovers
from component failures (restart or reboot)
Sequences
the shutdown of all the various cluster components
This
is currently provided by Linux-HA and bundled with the Linux-HA
communications code. This likely needs to be separated out to a separate proxy function (process) in the
future.
Infrastructure
The
Linux-HA infrastructure libraries (“clplumbing”) does a wide
variety of things. A few samples include:
Inter-Process
Communication
Process
management
Event
management and scheduling
Many
other miscellaneous functions
Surprisingly,
these libraries amount to about 20K lines of code.
Quorum Daemon
The
quorum daemon is an unusual daemon, because it's the only daemon we
have that's intended to run outside the cluster proper. It is
instrumental in solving certain
knotty quorum problems – especially for:
This was
discussed extensively in previous postings [4]
and [5].
Management Daemon
Provides
Complete Authenticated Configuration and Status API. This includes
both information contained in the CIB, and also information about the
communications configuration and so on.
The
management daemon is used by the:
Clients
are authenticated using PAM, and all communications is via SSL, so
its clients can safely be outside cluster, or even outside a
firewall. This daemon should provide different levels of
authorization depending on the authenticated user, and should log its
actions in a format suitable for Sarbanes-Oxley (SOX) auditing
purposes.
GUI
Provides
a Graphical User Interface providing configuration
and status information. It also supports creating and configuring
the cluster. Note that at the present time, there are a number of
useful cluster configurations it cannot create.
CIM and SNMP agents
The CIM
and SNMP agents provide CIM and SNMP management interfaces for
systems management tools. The CIM
interface supports status updates and configuration changes, whereas
the SNMP interfaces only report status.
Disadvantages of this architecture
For a
variety of reasons, kernel space doesn't have access to user-space
cluster communications or membership.
As a
result, both the DLM and most cluster filesytems implements their
own membership and communications.
This is
in contradiction to the “ideal world” statements earlier. This
can result in some odd cases where one communication method is
working in a particular case, but another method is not. This
results in differences in membership – which can have bad effects.
Why this might not be quite as bad as it seems
One
reason why one might not worry about this as much as one might, is
because it's a problem which one can't make go away. A cluster
system will always have to interface with software packages which do
their own communication, and compute their own membership for a
variety of usually good reasons. As a result, this is a problem
which we can't make go away. Instead we have to deal with it
effectively. There are basically two cases to consider:
The
“Main” membership thinks that node X should not be in the
cluster, whereas the “Other” membership thinks it should be.
The
“Other” membership thinks that node X should not be in the
cluster, whereas the “Main” membership thinks it should be.
Let's
take these two cases one at a time:
Case 1:
If the main membership thinks node X is not in the
cluster, then it will simply not start any resources on node X. This
takes care of the problem.
Case 2:
If the “Other” membership discovers that a
particular node should be dropped from its view of membership, and it
can inform the CRM not to start its resources on that machine, then
the local view of this membership from the perspective of the
resources it deals with is effectively made to exclude these
Other-errant nodes. In the Linux-HA CRM this is easily done having
the Other-resources write node attributes to cause those nodes to be
excluded, and the rules would then be written to exclude those nodes
from consideration for running Other-related resources.
Although Case 2 isn't pretty, it works, and no
amount of wishing and hoping is likely to ever make this kind of
problem go away in the general case - particularly when one involves proprietary applications So, even if there is some
membership discrepancy, it can is always possible to manage it appropriately assuming you can get a tiny bit of cooperation from the application.
References
[1]
http://linux-ha.org/
[2]
http://techthoughts.typepad.com/managing_computers/2007/10/split-brain-quo.html
[3]
http://linux-ha.org/STONITH
[4]
http://techthoughts.typepad.com/managing_computers/2007/10/more-about-quor.html
[5]
http://techthoughts.typepad.com/managing_computers/2007/11/quorum-server-i.html