« November 2007 | Main | January 2008 »

December 2007

December 13, 2007

How Managed Virtualization (including HA) conflicts with System Management

Managed Virtualization Versus System Management

In an earlier post[1], I talked about a couple of kinds of virtualization, comparing two of them and highlighting their strengths.  This posting discusses how virtualization can confuse and confound conventional systems management - both automated and manual, and gives some thoughts on how to deal with it.

We all know that virtualization is a GoodThing(TM).  Therefore, it can't really have any disadvantages, can it?  <tongue-in-cheek-off> Unfortunately, it does have disadvantages.  The great strength of virtualization is its ability to break the ties between a service or operating system and the server which implements its service.  Many software systems and a good number of human beings find this confusing.  If I want to reboot a physical server, what services or operating systems will be disrupted by the reboot?

Conversely, if I want to do something to the machine that's running a particular service, which machine do I have to log into?  If you're running both service virtualization (conventional HA like Linux-HA[2]) on top of server virtualization (ala Xen or VMware), then you have a doubly difficult task - first you have to figure out which virtual machine is running a service, then you have to figure out which physical machine is running that particular virtual machine.

This can be really annoying and can easily result in system administrators[3] making mistakes either in the middle of the night, or when under pressure (which all sysadmins know is pretty much all the time).

Remember - Complexity is the Enemy of Reliability.   This is just another example of my favorite phrase at work.

And, if you want to have server monitoring software which tries to figure out whether a service is stopped and have it restart it, then it can also get confused by the fact that all these stupid servers and services are always moving around.  They just won't stay put!  Back in the olden days, you logged into a server and you edited the inittab, and you always knew what hardware it was running on and what server it was.  Now, with virtualization, and especially with virtualization management software, you never know what's where.

A Recipe for Chaos and Conflict

Your HA software and/or your virtualization management software can move things around on you.  Imagine that you have these four kinds of things in your data center:

  • High-Availability (HA/service-virtualization) management software

  • Virtualization management software

  • System management monitoring software

  • Human system administrators

This is a recipe for chaos, interspersed with the occasional career-limiting disaster. It's this kind of thing that leads system administrators to pull their hair out, and keep their resumes up to date.  None of these is bad by itself, in fact, each is a GoodThing(TM).  But they don't normally play well with each other. In typical myopic software design fashion, each of these layers is usually unaware of the other layers (except, of course for the last (human) layer - who has to make up for all the poor integration).

In addition, since the software layers typically aren't aware of all this wonderful virtualization going on, they can't really deal with the picture reliably.  They don't know what should be happening where, because it isn't fixed.  The various virtualization management packages keep changing things!

So, what's a body to do?  As far as I know, there are two basic options.

  1. Integrate the four layers of management with each other using things like CIM[4] and SNMP[5]

  2. Empower your HA software to also manage the server virtualization of your data center

Integration of Layers

Virtually every data center (sadly, pun intended) has a variety of server types and a variety of operating systems, and a variety of management software.  They mostly don't play well with each other.  Almost the only way to get them to play together - even if imperfectly - is to have them talk together using industry standard protocols.

Today, that means using SNMP or CIM.   Here is my personal view on the characteristics of these two protocols for your consideration.

  • SNMP - widely deployed - implemented in a truly compatible way, but far too weak for a job this hard.  SNMP is great for grabbing statistics, checking whether a server or router is up and what kind of load it is seeing in great detail.  Anything much beyond this, and the MIBs become 100% vendor-specific - meaning that cross-vendor integration breaks down - basically completely.  For HA clustering or virtualization management or worse yet the combination of the two - forget it.

  • CIM - widely deployed in expensive disk subsystems - but rarely deployed outside that.  It has newly developed models for virtualization and clustering, but like most standards they're mostly lowest-common-denominator standards, and unfortunately not widely deployed.  For example, Linux-HA[2] implements CIM, but unfortunately Linux-HA has tremendous power and capability which CIM can't begin to model.  So, this winds up being only possible to model using vendor-specific extensions - greatly weakening the possible integrations.

Now, I'm not saying that these two protocols are useless - far from it. Without open standards like CIM and SNMP, the prospect truly is hopeless.   But I am saying that integrating them in the typical-for-the-industry highly-heterogeneous data center is a challenge, and the more layers there are to integrate, the bigger the challenge.  Since standards necessarily trail industry practice, the more "bleeding edge" the topic (i.e., HA clustering or virtualization) and the more powerful the underlying tool (like Linux-HA), the greater the mismatch.

Here we have two bleeding edge topics and four layers.  Yikes!  Surely there must be some kind of alternative to this somewhat-unattractive mess.

Decrease The Layers and Let Them Manage Themselves

As I mentioned in my earlier virtualization posting, some HA packages (like Linux-HA) can also manage virtualization simultaneously.  So, one way of dealing with this is to let (or extend) your service virtualization product also manage your server virtualization.  One advantage of this approach is that service virtualization software (HA software) is comparatively mature technology, minimizing the risk.

Unfortunately, this doesn't yet go all the way in solving the problem either.  There are a few things that should change to make this really work well. These include

  • Support much larger HA clusters - hundreds to thousands of nodes.  In an ideal world, you'd really like fewer of these HA/virtualization clusters as you can get.  Today you'd typically have to have one of these clusters for every 8-32 physical servers - which makes an awfully lot of these things to manage in a data center containing hundreds or thousands of servers.

  • Integrate with many virtualization layers - Such a product would need to integrate with Xen, IBM System Z, IBM System P, Linux KVM, VMware, and future virtualization layers like the one promised by Microsoft.   This isn't rocket science, but by the time you're done, it will be some work.

  • Support monitoring and controlling services inside the virtual machine - Otherwise you haven't really integrated the two layers - and you wind up running some HA software inside some of the virtual machines.  Again, this isn't rocket science, but it will require some work[1] for each operating system you want to manage services for.

  • Integrate with provisioning systems - so that you can add and delete virtual machines and allocate disk to them and their applications with fewer possibilities for error, and more automation.

None of these items are technically difficult, and none of them are prohibitively expensive to implement.  Given that I'm the project leader for Linux-HA, and Linux-HA is one of the most capable HA products around, you might imagine that some of these thoughts are on my mind for our future  ;-).  Of course, that doesn't eliminate the necessity for integration with the remaining layers above, which is why Linux-HA implements both CIM and SNMP.  This allows the virtualization management infrastructure to actively and autonomically manage  servers and services, while letting it bubble up events (especially those it can't automatically recover from) to the management consoles and humans via protocols like SNMP and/or CIM.

Conclusions

Virtualization technologies add complexity to the data center along with the benefits they bring, and in the process may render the existing management facilities less than useful.  However, if HA and Virtualization management are performed by a single entity, and open standards like CIM and SNMP are used, systems can be active the problems can be minimized.

See Also

Preparing for Virtual Management http://www.itbusinessedge.com/blogs/dcc/?p=276

References

[1] http://techthoughts.typepad.com/managing_computers/2007/09/virtualization-.html
[2] http://linux-ha.org/
[3] http://linux-ha.org/SysAdmin
[4] http://www.dmtf.org/standards/cim/
[4] http://en.wikipedia.org/wiki/Common_Information_Model_%28computing%29
[5] http://en.wikipedia.org/wiki/Simple_Network_Management_Protocol

December 04, 2007

A brief overview of load balancing techniques

Something that people commonly do which involves a form of automation is load balancing.  Load balancing is the idea that incoming network requests are distributed across a set of servers which then each provide the same service.  If you spread the load across "n" servers, then in an ideal world what you get is "n" times the throughput.  And, since you have redundant servers, with the right kind of automation software, you can also get a degree of high-availability.   This is way cool!  This article will talk about load balancing as a general technique, and specifically about ways to do it on Linux using free or open source software.  In particular we'll talk about the Linux Virtual Server project[1], (LVS, ipvs) and the Cluster IP[2] as load balancing techniques.

Meanwhile back in the real world, we see some slight differences from this ideal view of things.  We see that load balancers often introduce single points of failure, and that that the load balancer or some kind of back end servers typically introduce scalability limitations.  To really understand these  problems, we need to look at specific load balancing techniques in a little more detail.  Please understand that I'm not an in-depth expert on any of these techniques, but I do have basic familiarity with the methods described here.

Linux Virtual Server
The first technique we'll cover is the Linux Virtual Server[1] (LVS) - which is implemented by the ipvs kernel module.  Much of what I have to say about LVS also applies to the most load balancers - hardware or software, since they typically work roughly the same way as LVS.

I usually describe  LVS clusters as being similar to a baseball[3] diamond - with the load balancer on third base, web (or other) "real servers" stretched from home plate to second base, and the back end database on first base.  In this image, requests flow from the left to right starting from the users in the dugout to the left of second base foul line,  and responses flow from right to left from the database or file server on first base back to the users in the dugout.  [This imagery works great when talking to Americans or Japanese on the phone, but often fails for people from other cultures].

The first thing to notice is that the only inherently scalable portion of this arrangement is the web servers in the middle.  The load balancer (on third base) and the database server (on first base) are each potentially performance bottlenecks and potentially single points of failure.

If you make each of them redundant to eliminate single points of failure, the picture looks something like this:

Diamondha640
There are a number of variations on this basic theme:

  • Failover vs load sharing load balancers

  • Different applications on the "real servers" instead of WAS / Web servers.

  • Different routing techniques for the load balancer

  • Different data sources instead of a DB2 database

In the end, however, they look a lot the same, and work very similarly.

In a NAT[5] arrangement, both incoming and outgoing packets flow through the LVS director.  In a direct routing arrangement, only incoming packets flow through the LVS director, and outgoing packets bypass the director, and go directly to the clients.

LVS monitoring
Although you could set this all up by hand and start all the services by hand, if anything failed, then you'd have to reconfigure things by hand.  Since the theme of this blog is automation, obviously, the right answer is to automate this setup and reconfiguration on failure.   A common way to do this is to use the Linux-HA software[6], which includes the LVS tool ldirectord[4]. Ldirectord will look at your real servers and see if they and the services they're running are operating correctly.  It will then take corrective action if it sees problems.  The Linux-HA software will watch the directors (sitting on third base), and fail things over and back if problems come up, to eliminate the single point of failure on third base.  As of now, the most common configurations of real servers have them be part of an LVS cluster, but not part of a Linux-HA cluster.  For historical reasons, the load balancers (directors) on third base are in one cluster, and the database server(s) on first base are commonly in separate clusters.  However, with release 2.x versions of Linux-HA it is perfectly sensible to include the both in the same cluster, perhaps in an n+1 sparing arrangement.  If you have fewer than 10-12 real servers, then it might also make sense to let Linux-HA manage those real servers as well.  The reason for the upper limit is to ensure that the total cluster isn't larger than the current Linux-HA limitations on cluster size (approximately 16 nodes).  Another  possible configuration is to use Linux-HA to monitor your real servers.  This would involve writing a clone resource agent for configuring LVS to point at the various real servers.  This might result in a more scalable monitoring arrangement than the current ldirectord monitoring arrangement, since the monitoring is done on each real server, and only errors are reported back to Linux-HA. 

This is a very brief overview of LVS, which perhaps we can expand on in a future posting.  For a thorough treatment of LVS,  I recommend The Linux Enterprise Cluster[7] by Karl Kopper.

Performance characteristics
Clearly every inbound packet has to go through the load balancer (director) - so it has to receive, look at, and forward each inbound packet.  It may also have to rewrite headers and recompute checksums on each packet.  If it configured with NAT, then it also has to read and rewrite all outbound packets as well. In addition, with ldirectord and similar software, the director also has the job of monitoring the all the real server processes on all the real servers.  Eventually, this node (or these nodes) will become a bottleneck.  When this happens depends on the nature of the workload, the complexity of monitoring, and the director configuration chosen.

Cluster IP
Although LVS doesn't require a master's degree to configure, some features of it do have a reasonably steep learning curve.  For a very easy-to-configure, albeit less scalable load distribution method on Linux, you might consider using ClusterIP addresses[2].

What is a Cluster IP?
The unique feature of a cluster IP is that it has no load balancer, hence no single point of failure.  Wow! That seems weird!  What does the picture look like?  If you move the users out of the dugout onto third base, you'll get the basic idea.  But that picture brings lots of questions to mind - like how do packets get routed?

The answer is simple - each machine in the cluster has the same IP address.  Say what?  The same IP address? Yes.  I mean the same IP address.   How can this work?  This sounds like it flies in the face of usual teaching about networking.  Which it does.

Enter the Multicast MAC address
The trick to making this work is to have each machine have an ARP table entry with the same MAC address in it - a multicast MAC address.  So when an ARP request is given, all nodes in the cluster respond, but they all give the same answer"I have IP address XXX with MAC address YYY".  So, in effect, there is no confusion - because it doesn't matter which ARP reply is listened to, they all say the same thing.  Therefore at the IP level everyone is happy.

So far, this is a reasonably satisfying answer, but not quite omplete.  What about addressing at the MAC level, and at the TCP or UDP level?

At the MAC level, multicast MAC addresses are recognized by switches, and is routed to all the switch ports, since everyone has presented that MAC address as "theirs". So, it copies all the packets to all the servers.

What happens at the TCP or UDP level?
This is where things get a little more interesting.  Now, it's more obvious how each machine gets the packets - because every machine gets them.  But, now what?  We clearly don't want every machine to respond to a given TCP packet. That would totally confuse everything, as would giving every packet to all the applications.  To solve this problem, Linux has added a hashing feature which allows the source address, source and destination port number to be used in a hashing function to allow it to decide which machine will respond to any given request.  So, if you have three hash buckets and three servers, the packet header information (source IP and port numbers) can be hashed into three buckets with one bucket assigned to each server.   If the packet hashes to the hash bucket assigned to this server, then it is kept, and passed along to the UDP or TCP layers.  If it doesn't hash to the bucket assigned to this server, then it's just dropped (ignored).

So, this hashing method determines which host serves the requests.  Although the ethernet driver in every machine sees each packet, each packet is only processed by one machine each.  Now you know how it works.

It also turns out to be very easy to configure using Linux-HA, as you can see on our ClusterIP web page[8].  In the process, Linux-HA also handles all the redundancy and failover of cluster IP buckets for you automatically.  Very cool indeed.

If you only configure one bucket per node, then when a node fails, all of its traffic has to get assigned to one machine.  If you start out with 3 nodes in your ClusterIP group, and one node dies, then that means that one node gets all the additional traffic - effectively doubling its workload.  So, a better idea for "n" nodes, to have n*(n-1) cluster IP buckets.  That way when any given machine fails, its workload is split evenly across the remaining nodes.  In Linux-HA terminology, the ClusterIP address is called a clone resource, and what you want is to configure clone_max to n*(n-1)and clone_node_max also to n*(n-1).  Although clone_node_max probably doesn't have to be this large, it would allow a single node to handle all the traffic, if a sufficient number of ClusterIP peers die.

Performance characteristics
Every node in the cluster will see all incoming IP packets.  As I understand it, many/all switches will also send every packet to every switch port in the subnet (or vlan).  This argues for a small subnet for this function.  But, the packets are discarded at a very early stage - minimizing the overhead on the host.  Outbound packets are not affected by this arrangement.  This kind of arrangement works well for these kinds of cases:

  • long processing time per packet (complex J2EE applications, for example)

  • small incoming packets with large outgoing packets

  • smaller number of processing nodes

It probably works less well with the opposite kinds of configurations:

  • high number of incoming packets with trivial processing per pacekt

  • large incoming packets (uploading DVD images, for example)

  • large number of processing nodes

Note that in this case, since there is no head-end processor like an LVS director that can be a single point of failure, so no special provisions are needed for high-availability when used with Linux-HA.  It is typically not as scalable as LVS load balancer, but it is trivial to set up and use.

[1] http://www.linuxvirtualserver.org/
[2] http://flaviostechnotalk.com/wordpress/index.php/2005/06/12/loadbalancer-less-clusters-on-linux/
[3] http://en.wikipedia.org/wiki/Baseball
[4] http://www.vergenet.net/linux/ldirectord/
[5] http://en.wikipedia.org/wiki/Network_address_translation
[6] http://linux-ha.org/
[7] http://www.nostarch.com/frameset.php?startat=cluster
[8] http://www.linux-ha.org/ClusterIP