Something that people commonly do which involves a form of automation is load balancing. Load balancing is the idea that incoming network requests are distributed across a set of servers which then each provide the same service. If you spread the load across "n" servers, then in an ideal world what you get is "n" times the throughput. And, since you have redundant servers, with the right kind of automation software, you can also get a degree of high-availability. This is way cool! This article will talk about load balancing as a general technique, and specifically about ways to do it on Linux using free or open source software. In particular we'll talk about the Linux Virtual Server project, (LVS, ipvs) and the Cluster IP as load balancing techniques.
Meanwhile back in the real world, we see some slight differences from this ideal view of things. We see that load balancers often introduce single points of failure, and that that the load balancer or some kind of back end servers typically introduce scalability limitations. To really understand these problems, we need to look at specific load balancing techniques in a little more detail. Please understand that I'm not an in-depth expert on any of these techniques, but I do have basic familiarity with the methods described here.
Linux Virtual Server
The first technique we'll cover is the Linux Virtual Server (LVS) - which is implemented by the ipvs kernel module. Much of what I have to say about LVS also applies to the most load balancers - hardware or software, since they typically work roughly the same way as LVS.
I usually describe LVS clusters as being similar to a baseball diamond - with the load balancer on third base, web (or other) "real servers" stretched from home plate to second base, and the back end database on first base. In this image, requests flow from the left to right starting from the users in the dugout to the left of second base foul line, and responses flow from right to left from the database or file server on first base back to the users in the dugout. [This imagery works great when talking to Americans or Japanese on the phone, but often fails for people from other cultures].
The first thing to notice is that the
only inherently scalable portion of this arrangement is the web
servers in the middle. The load balancer (on third base) and the
database server (on first base) are each potentially performance
bottlenecks and potentially single points of failure.
If you make each of them redundant to eliminate single points of failure, the picture looks something like this:
Failover vs load sharing load balancers
Different applications on the "real servers" instead of WAS / Web servers.
Different routing techniques for the load balancer
Different data sources instead of a DB2 database
In the end, however, they look a lot
the same, and work very similarly.
In a NAT
arrangement, both incoming and outgoing packets flow through the LVS
director. In a direct routing arrangement, only incoming packets
flow through the LVS director, and outgoing packets bypass the
director, and go directly to the clients.
Although you could set this all up by hand and start all the services by hand, if anything failed, then you'd have to reconfigure things by hand. Since the theme of this blog is automation, obviously, the right answer is to automate this setup and reconfiguration on failure. A common way to do this is to use the Linux-HA software, which includes the LVS tool ldirectord. Ldirectord will look at your real servers and see if they and the services they're running are operating correctly. It will then take corrective action if it sees problems. The Linux-HA software will watch the directors (sitting on third base), and fail things over and back if problems come up, to eliminate the single point of failure on third base. As of now, the most common configurations of real servers have them be part of an LVS cluster, but not part of a Linux-HA cluster. For historical reasons, the load balancers (directors) on third base are in one cluster, and the database server(s) on first base are commonly in separate clusters. However, with release 2.x versions of Linux-HA it is perfectly sensible to include the both in the same cluster, perhaps in an n+1 sparing arrangement. If you have fewer than 10-12 real servers, then it might also make sense to let Linux-HA manage those real servers as well. The reason for the upper limit is to ensure that the total cluster isn't larger than the current Linux-HA limitations on cluster size (approximately 16 nodes). Another possible configuration is to use Linux-HA to monitor your real servers. This would involve writing a clone resource agent for configuring LVS to point at the various real servers. This might result in a more scalable monitoring arrangement than the current ldirectord monitoring arrangement, since the monitoring is done on each real server, and only errors are reported back to Linux-HA.
This is a very brief overview of LVS,
which perhaps we can expand on in a future posting. For a thorough
treatment of LVS, I recommend The Linux Enterprise Cluster
by Karl Kopper.
Clearly every inbound packet has to go through the load balancer (director) - so it has to receive, look at, and forward each inbound packet. It may also have to rewrite headers and recompute checksums on each packet. If it configured with NAT, then it also has to read and rewrite all outbound packets as well. In addition, with ldirectord and similar software, the director also has the job of monitoring the all the real server processes on all the real servers. Eventually, this node (or these nodes) will become a bottleneck. When this happens depends on the nature of the workload, the complexity of monitoring, and the director configuration chosen.
Although LVS doesn't require a master's degree to configure, some features of it do have a reasonably steep learning curve. For a very easy-to-configure, albeit less scalable load distribution method on Linux, you might consider using ClusterIP addresses.
What is a Cluster IP?
The unique feature of a cluster IP is that it has no load balancer, hence no single point of failure. Wow! That seems weird! What does the picture look like? If you move the users out of the dugout onto third base, you'll get the basic idea. But that picture brings lots of questions to mind - like how do packets get routed?
The answer is simple - each machine in
the cluster has the same IP address. Say what? The same IP address?
Yes. I mean the same IP address. How can this work? This sounds
like it flies in the face of usual teaching about networking. Which
Enter the Multicast MAC
The trick to making this work is to have each machine have an ARP table entry with the same MAC address in it - a multicast MAC address. So when an ARP request is given, all nodes in the cluster respond, but they all give the same answer"I have IP address XXX with MAC address YYY". So, in effect, there is no confusion - because it doesn't matter which ARP reply is listened to, they all say the same thing. Therefore at the IP level everyone is happy.
So far, this is a reasonably satisfying answer, but not quite omplete. What about addressing at the MAC level, and at the TCP or UDP level?
At the MAC level, multicast MAC addresses are recognized by switches, and is routed to all the switch ports, since everyone has presented that MAC address as "theirs". So, it copies all the packets to all the servers.
What happens at the TCP
or UDP level?
This is where things get a little more interesting. Now, it's more obvious how each machine gets the packets - because every machine gets them. But, now what? We clearly don't want every machine to respond to a given TCP packet. That would totally confuse everything, as would giving every packet to all the applications. To solve this problem, Linux has added a hashing feature which allows the source address, source and destination port number to be used in a hashing function to allow it to decide which machine will respond to any given request. So, if you have three hash buckets and three servers, the packet header information (source IP and port numbers) can be hashed into three buckets with one bucket assigned to each server. If the packet hashes to the hash bucket assigned to this server, then it is kept, and passed along to the UDP or TCP layers. If it doesn't hash to the bucket assigned to this server, then it's just dropped (ignored).
So, this hashing method determines
which host serves the requests. Although the ethernet driver in
every machine sees each packet, each packet is only processed
by one machine each. Now you know how it works.
It also turns out to be very easy to
configure using Linux-HA, as you can see on our ClusterIP web
page. In the
process, Linux-HA also handles all the redundancy and failover of
cluster IP buckets for you automatically. Very cool indeed.
If you only configure one bucket per
node, then when a node fails, all of its traffic has to get assigned
to one machine. If you start out with 3 nodes in your ClusterIP
group, and one node dies, then that means that one node gets all the
additional traffic - effectively doubling its workload. So, a better
idea for "n" nodes, to have n*(n-1)
cluster IP buckets. That way when any given machine fails, its
workload is split evenly across the remaining nodes. In Linux-HA
terminology, the ClusterIP address is called a clone resource, and
what you want is to configure clone_max to n*(n-1)and
clone_node_max also to n*(n-1). Although
clone_node_max probably doesn't have to be this large, it
would allow a single node to handle all the traffic, if a sufficient
number of ClusterIP peers die.
Every node in the cluster will see all incoming IP packets. As I understand it, many/all switches will also send every packet to every switch port in the subnet (or vlan). This argues for a small subnet for this function. But, the packets are discarded at a very early stage - minimizing the overhead on the host. Outbound packets are not affected by this arrangement. This kind of arrangement works well for these kinds of cases:
long processing time per packet (complex J2EE applications, for example)
small incoming packets with large outgoing packets
smaller number of processing nodes
It probably works less well with the opposite kinds of configurations:
high number of incoming packets with trivial processing per pacekt
large incoming packets (uploading DVD images, for example)
large number of processing nodes
Note that in this case, since there is
no head-end processor like an LVS director that can be a single point
of failure, so no special provisions are needed for high-availability
when used with Linux-HA. It is typically not as scalable as LVS load
balancer, but it is trivial to set up and use.