Web/Tech

December 04, 2007

A brief overview of load balancing techniques

Something that people commonly do which involves a form of automation is load balancing.  Load balancing is the idea that incoming network requests are distributed across a set of servers which then each provide the same service.  If you spread the load across "n" servers, then in an ideal world what you get is "n" times the throughput.  And, since you have redundant servers, with the right kind of automation software, you can also get a degree of high-availability.   This is way cool!  This article will talk about load balancing as a general technique, and specifically about ways to do it on Linux using free or open source software.  In particular we'll talk about the Linux Virtual Server project[1], (LVS, ipvs) and the Cluster IP[2] as load balancing techniques.

Meanwhile back in the real world, we see some slight differences from this ideal view of things.  We see that load balancers often introduce single points of failure, and that that the load balancer or some kind of back end servers typically introduce scalability limitations.  To really understand these  problems, we need to look at specific load balancing techniques in a little more detail.  Please understand that I'm not an in-depth expert on any of these techniques, but I do have basic familiarity with the methods described here.

Linux Virtual Server
The first technique we'll cover is the Linux Virtual Server[1] (LVS) - which is implemented by the ipvs kernel module.  Much of what I have to say about LVS also applies to the most load balancers - hardware or software, since they typically work roughly the same way as LVS.

I usually describe  LVS clusters as being similar to a baseball[3] diamond - with the load balancer on third base, web (or other) "real servers" stretched from home plate to second base, and the back end database on first base.  In this image, requests flow from the left to right starting from the users in the dugout to the left of second base foul line,  and responses flow from right to left from the database or file server on first base back to the users in the dugout.  [This imagery works great when talking to Americans or Japanese on the phone, but often fails for people from other cultures].

The first thing to notice is that the only inherently scalable portion of this arrangement is the web servers in the middle.  The load balancer (on third base) and the database server (on first base) are each potentially performance bottlenecks and potentially single points of failure.

If you make each of them redundant to eliminate single points of failure, the picture looks something like this:

Diamondha640
There are a number of variations on this basic theme:

  • Failover vs load sharing load balancers

  • Different applications on the "real servers" instead of WAS / Web servers.

  • Different routing techniques for the load balancer

  • Different data sources instead of a DB2 database

In the end, however, they look a lot the same, and work very similarly.

In a NAT[5] arrangement, both incoming and outgoing packets flow through the LVS director.  In a direct routing arrangement, only incoming packets flow through the LVS director, and outgoing packets bypass the director, and go directly to the clients.

LVS monitoring
Although you could set this all up by hand and start all the services by hand, if anything failed, then you'd have to reconfigure things by hand.  Since the theme of this blog is automation, obviously, the right answer is to automate this setup and reconfiguration on failure.   A common way to do this is to use the Linux-HA software[6], which includes the LVS tool ldirectord[4]. Ldirectord will look at your real servers and see if they and the services they're running are operating correctly.  It will then take corrective action if it sees problems.  The Linux-HA software will watch the directors (sitting on third base), and fail things over and back if problems come up, to eliminate the single point of failure on third base.  As of now, the most common configurations of real servers have them be part of an LVS cluster, but not part of a Linux-HA cluster.  For historical reasons, the load balancers (directors) on third base are in one cluster, and the database server(s) on first base are commonly in separate clusters.  However, with release 2.x versions of Linux-HA it is perfectly sensible to include the both in the same cluster, perhaps in an n+1 sparing arrangement.  If you have fewer than 10-12 real servers, then it might also make sense to let Linux-HA manage those real servers as well.  The reason for the upper limit is to ensure that the total cluster isn't larger than the current Linux-HA limitations on cluster size (approximately 16 nodes).  Another  possible configuration is to use Linux-HA to monitor your real servers.  This would involve writing a clone resource agent for configuring LVS to point at the various real servers.  This might result in a more scalable monitoring arrangement than the current ldirectord monitoring arrangement, since the monitoring is done on each real server, and only errors are reported back to Linux-HA. 

This is a very brief overview of LVS, which perhaps we can expand on in a future posting.  For a thorough treatment of LVS,  I recommend The Linux Enterprise Cluster[7] by Karl Kopper.

Performance characteristics
Clearly every inbound packet has to go through the load balancer (director) - so it has to receive, look at, and forward each inbound packet.  It may also have to rewrite headers and recompute checksums on each packet.  If it configured with NAT, then it also has to read and rewrite all outbound packets as well. In addition, with ldirectord and similar software, the director also has the job of monitoring the all the real server processes on all the real servers.  Eventually, this node (or these nodes) will become a bottleneck.  When this happens depends on the nature of the workload, the complexity of monitoring, and the director configuration chosen.

Cluster IP
Although LVS doesn't require a master's degree to configure, some features of it do have a reasonably steep learning curve.  For a very easy-to-configure, albeit less scalable load distribution method on Linux, you might consider using ClusterIP addresses[2].

What is a Cluster IP?
The unique feature of a cluster IP is that it has no load balancer, hence no single point of failure.  Wow! That seems weird!  What does the picture look like?  If you move the users out of the dugout onto third base, you'll get the basic idea.  But that picture brings lots of questions to mind - like how do packets get routed?

The answer is simple - each machine in the cluster has the same IP address.  Say what?  The same IP address? Yes.  I mean the same IP address.   How can this work?  This sounds like it flies in the face of usual teaching about networking.  Which it does.

Enter the Multicast MAC address
The trick to making this work is to have each machine have an ARP table entry with the same MAC address in it - a multicast MAC address.  So when an ARP request is given, all nodes in the cluster respond, but they all give the same answer"I have IP address XXX with MAC address YYY".  So, in effect, there is no confusion - because it doesn't matter which ARP reply is listened to, they all say the same thing.  Therefore at the IP level everyone is happy.

So far, this is a reasonably satisfying answer, but not quite omplete.  What about addressing at the MAC level, and at the TCP or UDP level?

At the MAC level, multicast MAC addresses are recognized by switches, and is routed to all the switch ports, since everyone has presented that MAC address as "theirs". So, it copies all the packets to all the servers.

What happens at the TCP or UDP level?
This is where things get a little more interesting.  Now, it's more obvious how each machine gets the packets - because every machine gets them.  But, now what?  We clearly don't want every machine to respond to a given TCP packet. That would totally confuse everything, as would giving every packet to all the applications.  To solve this problem, Linux has added a hashing feature which allows the source address, source and destination port number to be used in a hashing function to allow it to decide which machine will respond to any given request.  So, if you have three hash buckets and three servers, the packet header information (source IP and port numbers) can be hashed into three buckets with one bucket assigned to each server.   If the packet hashes to the hash bucket assigned to this server, then it is kept, and passed along to the UDP or TCP layers.  If it doesn't hash to the bucket assigned to this server, then it's just dropped (ignored).

So, this hashing method determines which host serves the requests.  Although the ethernet driver in every machine sees each packet, each packet is only processed by one machine each.  Now you know how it works.

It also turns out to be very easy to configure using Linux-HA, as you can see on our ClusterIP web page[8].  In the process, Linux-HA also handles all the redundancy and failover of cluster IP buckets for you automatically.  Very cool indeed.

If you only configure one bucket per node, then when a node fails, all of its traffic has to get assigned to one machine.  If you start out with 3 nodes in your ClusterIP group, and one node dies, then that means that one node gets all the additional traffic - effectively doubling its workload.  So, a better idea for "n" nodes, to have n*(n-1) cluster IP buckets.  That way when any given machine fails, its workload is split evenly across the remaining nodes.  In Linux-HA terminology, the ClusterIP address is called a clone resource, and what you want is to configure clone_max to n*(n-1)and clone_node_max also to n*(n-1).  Although clone_node_max probably doesn't have to be this large, it would allow a single node to handle all the traffic, if a sufficient number of ClusterIP peers die.

Performance characteristics
Every node in the cluster will see all incoming IP packets.  As I understand it, many/all switches will also send every packet to every switch port in the subnet (or vlan).  This argues for a small subnet for this function.  But, the packets are discarded at a very early stage - minimizing the overhead on the host.  Outbound packets are not affected by this arrangement.  This kind of arrangement works well for these kinds of cases:

  • long processing time per packet (complex J2EE applications, for example)

  • small incoming packets with large outgoing packets

  • smaller number of processing nodes

It probably works less well with the opposite kinds of configurations:

  • high number of incoming packets with trivial processing per pacekt

  • large incoming packets (uploading DVD images, for example)

  • large number of processing nodes

Note that in this case, since there is no head-end processor like an LVS director that can be a single point of failure, so no special provisions are needed for high-availability when used with Linux-HA.  It is typically not as scalable as LVS load balancer, but it is trivial to set up and use.

[1] http://www.linuxvirtualserver.org/
[2] http://flaviostechnotalk.com/wordpress/index.php/2005/06/12/loadbalancer-less-clusters-on-linux/
[3] http://en.wikipedia.org/wiki/Baseball
[4] http://www.vergenet.net/linux/ldirectord/
[5] http://en.wikipedia.org/wiki/Network_address_translation
[6] http://linux-ha.org/
[7] http://www.nostarch.com/frameset.php?startat=cluster
[8] http://www.linux-ha.org/ClusterIP

October 05, 2007

How to use a watchdog timer

In an earlier posting[1], I mentioned that explaining how to optimally use a watchdog driver would be a good thing to talk about later. Now seems a good time to talk about that, giving a brief overview of some good techniques for getting the most out out of your watchdog timer.

As was mentioned previously, one can have a software watchdog timer like softdog[2], or a hardware timer, or a watchdog utility like apphbd[3] (application heartbeat daemon). Although each method has its advantages and disadvantages, the methods that an application can use to take best advantage of them are very similar.

The basic idea of using a watchdog is simple:  Periodically send a heartbeat to the watchdog timer.  If the application fails to heartbeat in the specified interval, or exits prematurely, then a recovery action is taken.  So, all your application has to do  is set a timer and tickle the watchdog timer when your timer goes off.  Sounds extremely simple - and for the most part it is.

How to get into trouble with watchdog timers

If your application does disk I/O or grows in size as it runs, or calls functions or systems calls that might block, then the timing of your application can change dramatically when the system is under heavy I/O load or memory pressure.  This can mean that your application is judged to be hung when it's not.  When this happens, the watchdog timer you're using will trigger a recovery action - maybe restarting your application, or rebooting the machine.  For this kind of a situation, this is probably not what you had in mind.  Of course, if you're in an HA environment like Linux-HA[4] where a machine reboot will cause a service failover, this may exactly what's needed to straighten out your problem.  As always, YMMV[5].

When not to use watchdog timers

Watchdog timers need reasonably reliable real-time performance, and an application which you can modify and which runs periodically (or which you're willing to make run periodically).  If you can't modify the application, or it is expected to have extremely erratic real-time performance, it doesn't run continuously, or runs in an environment which services your application erratically, then watchdog timers may not be for you.

Making your watchdog timer do more

Having your application not appear to be dead is sort-of-OK, but not exactly a deep metric for how well your application is behaving. Since this code will run inside your application, and you have to modify your application anyway, you have the opportunity to make this a form of white-box testing [6]. To do this, tie tickling your watchdog timer to good measures of your program's sanity. Below are a few examples of how you might go about doing this.  Not every example is appropriate for every type of application, so take this as food for thought.

  • Audit your data structures. If you have several data structures representing work to do, clients, outputs, etc., you can audit your data structures for mutual consistency, and only send out heartbeats when you don't find any errors.  For example, perhaps every piece of work to be done ought to belong to an active client. This is an interesting technique - because it can allow for transient "false positives" or errors that get corrected by the natural flow of the application.  If you audit your data structures once every 10 seconds, but set your heartbeat rate to 30 seconds, then you can have a transient error last up to two iterations without causing a restart.  If it persists beyond that, your watchdog timer will take action.

  • Check for work being processed.  If you have work in your input queues, but no work has been completed since the last heartbeat, then suppress sending out a heartbeat until some work actually gets processed.

  • Check for old work to be done.  If you have work queues, you can skip heartbeats whenever you have work in your queues which is "too old".  This may be a symptom that your application isn't processing its work, or that it has somehow lost track of this particular piece of work.

Of course, these are just a few simple ideas, but they may spark some better ideas for your particular application.  Anything you can examine periodically to see if your application seems to be doing whatever it is it's supposed to do is a potential candidate for using to control when and whether to tickle your watchdog timer.

Making your watchdog timer more reliable

Something you should really avoid is false positives on watchdog timeouts - especially when the consequence is to reboot the machine. Spurious reboots are typically frowned upon ;-).  This is all a fine idea, but how exactly do you go about doing that?  Here are a few tips to keep in mind.

  1. Tune your timeout interval. Some watchdog timers (like apphbd) have a warning level as well as a fatal timing level. Be sure and take advantage of the warning level to help you tune your application's heartbeat interval.  For example, if your application really ought to send out a heartbeat every second, you can set the warning time threshold to 1.5 seconds, and the fatal watchdog timer to a much larger value, say 5 or 10 seconds. That way, as you test, you can see how close you come to the 3 second time limit in your most extreme cases of load.

  2. Know your application's expected behavior.  If is single-threaded and does short-lived tasks throughout the day, except for once a day when it produces a summary report, then keep that extra work and the delay it can cause in mind when setting your timer value.

  3. Make your application a better real-time citizen. Rather than processing all its work in one huge uninterruptible chunk, make sure you process chunks whose size is bounded by some constant amount of time, and allow interruptions (for processing watchdog timer requests) between these chunks.

  4. Consider using the glib mainloop task dispatcher[7]. The mainloop construct is great for event driven programs which don't actually require threading. We use it extensively in Heartbeat [4], and it has worked out very well for us.

  5. Consider using threads. Many people swear by threads as the only way to write any reasonable program. Many people (including many of those same people) swear at them[8], When properly used, threads can be helpful in writing programs with better real-time behavior.

  6. Consider doing your disk I/O into a separate process (or thread). It's usually disk I/O or memory allocation (implying disk I/O) which is most likely to hang your program and give it unpredictable realtime behavior.

  7. Consider using asynchronous I/O. This is another technique for avoiding blocking by disk I/O. It's not terribly portable, and the API seems to me to be a little subject to change, but it's a really nice idea.

  8. Consider locking your program into memory. If your program needs about the same amount of real memory as it occupies virtual memory, then the system impact of locking your program into memory may not be high. This is not for everyone - because if everyone did this, then why bother implementing virtual memory?

  9. Consider setting your program with a soft-realtime (POSIX realtime) priority. Even more than the previous step, this step is not for everyone. If your program goes into an infinite loop, the entire system stops - end of story. Not good. But, if your program is critical, small, and well-behaved, this can be a reasonable thing to do.

  10. Consider running real time Linux [9]. This is an even more drastic step, but if your whole system needs to be real time, some great work has been done here which you might look into.

  11. If you're writing in Java, consider real-time Java from IBM[10]. A better recommendation might be to not use Java, but for some people Java is their religion, but if you or your management insists on Java, then this may be just the ticket for you.

In summary, here are general ideas to keep in mind:

  • Tie tickling the watchdog timer with the sanity of your application

  • Do what is reasonable to improve the predictability of the realtime behavior of your program

Hopefully the tips above will help you do this.

References

[1] http://techthoughts.typepad.com/managing_computers/2007/09/monitoring---a-.html
[2] http://linux-ha.org/softdog
[3] http://linux.die.net/man/8/apphbd
[4] http://linux-ha.org/
[5] http://en.wiktionary.org/wiki/your_mileage_may_vary
[6] http://en.wikipedia.org/wiki/White_box_testing
[7] http://library.gnome.org/devel/glib/unstable/glib-The-Main-Event-Loop.html
[8] http://sourcefrog.net/weblog/software/languages/java/java-threads.html
[9] http://www-03.ibm.com/press/us/en/pressrelease/21232.wss
[10] http://domino.research.ibm.com/comm/research_projects.nsf/pages/metronome.index.html

September 28, 2007

Virtualization as High Availability (Disaster Recovery) or High-Availability as Virtualization?

Many pundits and other folks like VMware's CEO Diane Greene have touted[1] virtualization as being the "cure" to disaster recovery, many for the past several years.  Disaster recovery can be pretty reasonably viewed as being high-availability over distance, so it makes some sense to see how DR, HA and virtualization fit together.  What's hype here, and what's real?  Let's look and see what we find out.

What is virtualization?

Wikipedia[2]defines virtualization like this:

In computing, virtualization is a broad term that refers to the abstraction of computer resources. One useful definition is "a technique for hiding the physical characteristics of computing resources from the way in which other systems, applications or end users interact with those resources".

At the moment, this term is most commonly used to refer to machine or OS-level virtualization,  storage virtualization and something I'll call container virtualization.  For now, let's ignore storage virtualization.   Machine-level virtualization means basically making it difficult to tell exactly which physical machine an OS is running on at any given point in time.  There are many other kinds of virtualization as well - for example service virtualization.   Service virtualization doesn't hide or abstract physical machines, but instead virtualizes or hides services.  That is, there is no fixed binding between services and physical machines.  This is, in fact, exactly what classic HA software does - software like Linux-HA[3] . What an HA system does is use service virtualization to recover from server failures, and monitors services so that it can restart them - either locally or on another server.

Customers care about services, not about OSes or physical machines, so from their point of view, machine and service virtualization are similar.  Both  provide the potential for recovering from failures of physical servers and OSes. However, since machine virtualization doesn't  have any visibility of the individual services and the dependencies between them, it can't monitor them and restart them if they fail.

But, the ideas of virtualization need management entities to oversee them and decide to move virtual entities, and coordinate the movement of these virtual entities from one place to another.    VMware will soon have their Site Recovery Manager[4], and Linux has Linux-HA[3] (among other solutions).  The purpose of such management entities is to detect and respond to failures and administrative requests by starting, stopping, and migrating virtual entities in their purview to different sets of physical servers.

What does server (or OS-level) virtualization brings to the table here?

Why would I care?  If HA systems produce a result which sounds so far like it would be similar or even superior from the customer's perspective to what OS level virtualization does (at least from an availability perspective), then why should I care?  The answer is not found in responses to failures, but in responses to administrative events.  If you want to migrate a running virtual machine from physical machine A to physical machine B, many server virtualization techniques (including Xen[10] and VMware) provide a new tool - transparent migration.  In transparent migration, the virtual machine image is migrated from physical machine A to physical machine B while processing work continually without appearing to stop.  This is a pretty cool trick, since you can then migrate virtual machines with apparently zero downtime.  Sadly, you can't migrate a crashed OS or an OS from a crashed server in this way.  As a result, it doesn't help at all with unplanned outages.  However, it makes it possible to take a planned outage with very minimal impact.  In a practical sense, if your server is running full-throttle and writing on all the pages in memory very rapidly, the migration will likely impact performance. So, it's best to not to migrate virtual machines during peak load without good reasons.


Can conventional HA software manage transparent migration?

The simple answer is "mostly no".  Most HA software has only three basic operations that they perform on their resources:  Start, Stop and Monitor.  If you map these operations to operate on virtual machines as resources, start means boot the OS, stop means shut it down, and monitor means look and see if it appears to be running correctly.  Note that each of these also operates on only one machine at a time. A migration like described above becomes stop the virtual machine on machine A, and then start it on machine B.  This is hardly transparent, since the stop and start operations create an outage which will range in time from two minutes to 30 or more minutes and completely loses all transient application state.

The problem is in the HA software's model of resources (or services).  The operations which it performs are like this OP(resource, node) for any given OP like start, stop or monitor.  What's needed instead is a operation model which permits OP(resource, fromnode, tonode).  In addition, when one adds dependencies, then the migratable resource can't depend on any non-migratable resource. [It's actually a little more complicated than this, but it would take too much time to explain all the details in this post).

Are there any HA packages that can model transparent migration?

Yes.  Sorry to sound a bit like a broken record[13], but Linux-HA[3] can do that very nicely - and to my knowledge it is the only HA system which can.  It includes support for migratable resources like virtual machines and container virtualization, and handles the complexities of dependencies alluded to above, along with other potential problems not alluded to above.  So, if you use Linux-HA to manage your virtual machines, you get full use of transparent migration without having to figure out some monumental kludge to shoehorn something kind of like transparent migration into your HA package.  It just works.  And, it just works for containers, and resources which support checkpoint restart.  All in all, a nice feature I'd say.

What does transparent migration have to do with disaster recovery?

Nothing.  Well... Almost nothing.  Let me explain...  Since transparent migration only helps move services when the machines are up, that's not normally very useful in a disaster - since presumably a fire or flood or power failure, or building collapse or whatever has rather rudely and typically unexpectedly interrupted service - precluding migration. Why did I say almost nothing above?  Because some kinds of disasters (hurricanes, and floods for example) often give warning in advance - so you can use transparent migration for those cases.  The other reason is that you can test your disaster recovery without any service outages.  Or rather, you can partially test your disaster recovery.  If there are any resources (disk drives for example) which are only needed during boot, then if those are missing, you probably won't notice when you migrate over and back. So, it is still desirable to periodically do non-transparent migrations of your services to make sure things are all still working after the latest configuration changes.  However, you can migrate over and back every weekend, and maybe once a month or once a quarter perform non-transparent migrations just to make sure all your bases are covered.

Can't you do this with an HA cluster split across sites?

Yes.  You can do it with only service virtualization and a conventional HA system, or you can do it with Linux-HA[3] and either service or server virtualization.  In either case, you have to prepare for the replication of data[5] across sites.  Note that making sure all the data can be accessed from multiple physical machines is part of the baseline discipline which is imposed both by virtual machine failover and by service failover.   This discipline is a starting place for any kind of disaster recovery.  This is a valuable discipline to undertake.  So, although this discipline of replicating data and OSes is a good thing, since it's not unique to virtual machines, this part is more hype than reality.  There's nothing special about virtual machines in this respect.  In this respect, any differences that exist, don't exist because of virtualization per se, but how easy the management tools are to use correctly, and how comprehensive they are.

Advantages of service virtualization

  • Monitors individual services [6] [7] - can detect and recover from software errors that don't cause OS crashes

  • Can recover individual services (more fine-grained)

  • Can recover servers and services without time for a reboot  (faster)

Advantages of Server virtualization

  • Transparent migration for administrative events is outage-free

  • All virtual machines more or less look alike (simplicity)

Best of Both Worlds?

Linux-HA[3] can model virtual machines as resources, and it can model normal services (web server, etc) as resources.  Can it do both at the same time?  Well... Yes and No...  Doing this would require that you have both virtual machines as resources and as cluster members at the same time.  Although you could do this, it might have unpleasant side effects.  However, you could have a cluster of virtual machines and a cluster of real machines - and that would work - but would be two clusters to manage.  To truly get the best of both worlds, you'd have to implement our container resource proposal[8] or something like it.  Of course, since Linux-HA is an open source project, we'd love to see a patch to implement it ;-).

See Also

Virtualization Blog[9],
Dabcc virtualization and server management web site[12],

References

[1]  http://www.byteandswitch.com/document.asp?doc_id=133643
[2] http://en.wikipedia.org/wiki/Virtualization
[3] http://linux-ha.org/
[4] http://www.vmware.com/company/news/releases/srm.html
[5] http://techthoughts.typepad.com/managing_computers/2007/09/automated-disas.html
[6] http://techthoughts.typepad.com/managing_computers/2007/09/monitoring---a-.html
[7] http://techthoughts.typepad.com/managing_computers/2007/09/tools-for-servi.html
[8] http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1417
[9] http://www.virtualization.info/
[10] http://www.xensource.com/
[11] http://www.vmware.com/
[12] http://www.dabcc.com/
[13] http://en.wikipedia.org/wiki/Gramophone_record

September 21, 2007

Tools for monitoring services

For the purposes of this posting, I'm concentrating more on things that can tell you if a particular service is working or not working and recover a failed service, and less on a datacenter-wide view of service health.

As I discussed in my previous posting on this topic[1], there are three basic ways to monitor a service:

  1. See if the process providing the service is alive
  2. Use the service API to determine if the service is well (simplified version: check to see if someone is listening on the port)
  3. Instrument the service to periodically heartbeat (check in with) a watchdog program or device

There are a lot of programs to monitor services using these three methods.  Many of the open source programs to provide these kinds of services are listed on the Linux-HA service monitoring page[2].  Rather than reproduce that page here, I'll discuss a few of the better-known programs found there. 

  • Monit[3] is very highly thought of by many sysadmins, and is often used for service monitoring and restart.  The information in this post about monit[3] was provided by the most excellent Christian Wilken.

    Monit is a monitoring tool which can take the necessary action to ensure service availability. It can monitor services locally or remotely by polling a specific port (type 2 monitoring). It can monitor binary files. I.e. you might want monit to monitor your apache binary file or any other binary on the system. It checks the md5sum and the octal permissions of the file and warns the administrator if something has changed (and unmonitor the service). It can also check the uid and gid of the file.

    Through the integrated web interface you can see uptime, cpu load and memory consumption for each service monitored. you can decide what to do when a service fails.

    Here's an example: you want to monitor apache (port 80) and if it fails you want to restart the service. If it fails to restart apache (after let's say 3 tries) you can set it up to raise an alarm (and send a mail to the admins). you can also setup the apache service monitoring to depend on apache_bin (the binary apache file). That means if the apache binary has changed the apache service is unmonitored until you take action.

    Monit also monitors its own binaries so if you (or someone else for that matter) updates the monit code you will also get an alarm and you have to take action. However, it does not monitor its own operation, or register itself with some other monitoring service like /dev/watchdog or apphbd.   Monit is a very flexible monitoring/automation tool and its config file is simple and easy to learn.  Monit is a simple yet effective tool.
  • Mon[4] is another well-thought-of monitoring tool.  It is a straightforward Perl program which has goals very similar to Monit.  Mon monitors services, and then takes actions when services are found to have failed.  For example, it will notify an administrator, or restart a service.  It comes with scripts which are capable of monitoring a large number of services.  It has been around for many years, and has many happy users.  It is simple to configure, normally monitors remotely, supports dependencies between monitors, typically is used as a type 2 monitoring service.  Like most similar services, nothing normally monitors mon's operation to see if it is still running, or operating correctly.
  • Nagios[7] is a very flexible system management package which provides for a lot of capabilities for monitoring servers and services  and taking a variety of actions when failures occur.  It is designed to watch and manage your entire data center.  For the most part, it uses SNMP to do that - however it can use other methods for monitoring as well.  One of the more interesting features of Nagios is that it supports service dependencies.  Normally, each service is defined by hand, and it can take a wide variety of actions when things fail, including escalation actions.  However, it does not natively provide process restarting.   Recently Gerard Petersen has documented[8] a way of getting nagios to do this with a consistent set of rules.  Understanding his method of doing this generically requires an extensive knowledge of Nagios and Gerard describes it as "somewhat cumbersome".  Nevertheless, it does add this capability to Nagios in a generic way.  It is a "type 1" monitoring system (done remotely with ssh) - it only checks if the process is alive.  You can tie this capability together with real service functionality checks (type 2) using dependencies.  However, if the service can't be stopped gracefully, and it has to do a kill -9 to stop it, then the service may not work properly afterwards.  This is a pretty unsurprising caveat.  However, it will attempt to kill the process and recover without manual intervention.  Nagios works remotely and performs both type 1 and type 2 checks (remotely), and implements service dependencies.

    It is worth noting that this Nagios technique will not work for servers which are part of an HA cluster, because there is no fixed association between a server name and which services are running on it.  You could use it to monitor the HA server processes if you like, but usually the HA software will do that itself.  Note that when you combine management services, you need to make sure you understand which management service is going to control what, know how each affects the other and  keep them strictly out of each other's way.

    Nagios does much more than just monitor services and restart daemons, however.  It keeps statistics, and provides historical and near-real-time performance and load graphs, provides an extensive set of alerts, and performs a variety of other system management functions. I don't know if/how the central Nagios processes are monitored.

  • apphbd[9] is part of the Linux-HA[10] suite of programs.  It performs type 3 monitoring of properly instrumented services.  That is, you can write a program which connects to apphbd, and sends it heartbeats periodically.  When your application doesn't send a heartbeat message by the expected time, or exits without informing apphbd of its intention to exit, then apphbd will notify another application of this condition, which will restart your application.  Apphbd normally registers with /dev/watchdog to send heartbeats to the kernel watchdog device.  As a precaution against false reboots, apphbd locks itself into memory and runs with soft realtime priority.

  • Heartbeat[10] from the Linux-HA project is typically thought of as a clustering package.  However, it is perfectly happy to manage a single system - which it thinks of as a cluster with one node.  It is capable of doing both type 1 and type 2 checks, implements service dependencies, and will automatically restart services if they die, including services which depend on failed service.   The configuration is in XML, and isn't difficult but can be tedious.

    For the case of monitoring a single machine, I wrote a prototype script[11] which looks at your init.d directories for all set of active services, and creates a cib.xml configuration file for the set of services on your machine.  Heartbeat will then poll to monitor those services and ensure that they are running, and restart them (and any services which depend on them) automatically.  Because the configuration this script creates relies solely on the init scripts, the monitoring is done by using the status actions of the LSB.  As a result, for the most part these are type 1 monitoring.  To do type 2 service monitoring, it is necessary to create an R2-style Heartbeat configuration and make sure the services to be monitored have a repeating monitor action declared for them.  Of course, the script mentioned before makes sure all this happens.

    If you wish to do more thorough monitoring, you can convert these scripts to OCF resource agents, modify the configuration and perform any kind of monitoring you wish.  Heartbeat itself can be monitored by SNMP or CIM - which lets you easily include Heartbeat as being monitored over the data center via Nagios, OpenNMS or other package.  In addition, if you wish to protect yourself against OS hangs and crashes and hardware problems, you can easily add servers to your one-node cluster, make some small rule changes, and now your service will continue running even if a server or OS dies.  The process of changing from a 1-node to an n-node cluster isn't at all complicated.  If one has things configured properly, and one uses the autojoin[12] feature.

    It is worth noting that the action which Heartbeat takes when a service can't be stopped is quite different from Nagios'.  In Heartbeat, when a resource (service) can't be stopped, Heartbeat would normally reboot the machine.  Although this sounds drastic, it can get most services running even when the problem is in the kernel.  Since normally Heartbeat is managing a cluster, the service would normally be taken over immediately by another node in the cluster.

    With regard to monitoring Heartbeat itself: it monitors its own operation, and it can be configured to heartbeat either with a watchdog device like softdog, or with apphbd (discussed above).  When you put it into a cluster, then the nodes of the cluster monitor each other, and use STONITH to reboot machines which have stopped working.  If you use Heartbeat then, you get as complete a set of monitoring as is available.
  • cl_respawn is a simple tool which is packaged as part of the Heartbeat[10] package.  cl_respawn takes arguments which tell it how often to heartbeat with apphbd, and the value of a "magic" exit code.  This exit code is one that when the application exits with this return code that cl_respawn does not restart the its child process.  Although it only does type 1 monitoring, it doesn't poll to see if a process has died, ithe OS gives it a signal and it responds immediately.  When you invoke a process with cl_respawn, it is necessary to make sure it does not fork off and run in the background.
  • Supervise[13] from D. J. Bernstein's daemontools[14] replaces the usual init.d with a service directory, so that things are installed, stopped and started and otherwise managed in a non-standard way.    However, as supervise is part of this whole process, service monitoring and restarting comes along for free.  When a daemon is started it becomes a child of supervise.  When a daemon dies, supervise is notified and it is automatically respawned immediately without waiting for a poll interval.  Supervise itself does not appear to be monitored for failures by svscan, which appears to be unmonitored.  Daemontools does not appear to support service dependencies.  Like cl_respawn, when you invoke a process with daemontools it is necessary to make sure it does not fork off and run in the background.
  • Watchdog drivers can either be implemented in software like softdog[15]  or they can be implemented in hardware.  All the watchdog drivers that ship with Linux share the same API - that implemented by softdog[15].  As a result, a program which works with softdog can work quite nicely with the hardware-based watchdog drivers.  It's worth noting that kernel hangs which completely disable kernel timer services can cause softdog to malfunction. The good news that hangs like this are extremely rare.   If one has a hardware based watchdog device, it takes a hardware failure for these to fail to reset a hung machine.  This too, is extremely rare.  If one is a true belt-and-suspenders sysadmin (with a enough time and budget), one could use both a software and hardware based watchdog
  • ldirectord[16] is a specialized monitoring system which is delivered as a subpackage of the Linux-HA suite.  It monitors servers and services in a load balancing cluster and takes dead servers out of the load balancing rotation, and restarts services when they stop responding.  It is a type 2 service, and because it's an HA resource, is normally monitored by Heartbeat itself.

See Also
Howtoforge articles on monitoring[17].

Related systems - not quite on topic

Ganglia[5] is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It relies on a multicast-based listen/announce protocol to monitor state within clusters and uses a tree of point-to-point connections amongst representative cluster nodes to federate clusters and aggregate their state.

Although it doesn't attempt to repair problems, it is widely used in high-performance clusters and might be of interest to people reading this post.

Opennms[6] is a Java-based enterprise grade open source network monitoring platform. It consists of a community supported open-source project as well as a commercial services, training and support organization.

The goal is for OpenNMS to be a truly distributed, scalable platform for all aspects of the FCAPS network management model, and to make this platform available to both open source and commercial applications.

OpenNMS is oriented towards responding to SNMP data and traps, but can be modified to monitor other things.  It does not appear to me to be well-suited to recovering from a service or process hanging, nor for monitoring a single system.  Where it shines is in managing an entire data center with a large number of SNMP-enabled servers and network devices.

References

[1]  http://techthoughts.typepad.com/managing_computers/2007/09/monitoring---a-.html
[2]  http://linux-ha.org/RelatedTechnologies/MonitoringSoftware
[3]  http://www.tildeslash.com/monit/
[4]  http://mon.wiki.kernel.org/index.php/Main_Page
[5]  http://ganglia.info/
[6]  http://www.opennms.org/
[7]  http://www.nagios.org/
[8]  http://www.gp-net.nl/2007/05/12/nagios-remote-eventhandling-for-init-scripts/en/
[9]  http://linux.die.net/man/8/apphbd
[10] http://linux-ha.org/
[11] http://hg.linux-ha.org/dev/raw-file/8c5da9553636/tools/1node2heartbeat
[12] http://linux-ha.org/ha.cf/AutojoinDirective
[13] http://cr.yp.to/daemontools/supervise.html
[14] http://cr.yp.to/daemontools.html
[15] http://en.wikibooks.org/wiki/Linux_Kernel_Drivers_Annotated/Character_Drivers/Softdog_Driver
[16] http://linux.die.net/man/8/ldirectord
[17] http://www.howtoforge.com/taxonomy_menu/1/59

September 15, 2007

Service Monitoring - basics of a key part of automated management

Software failures are pretty common.  If you've used software, you've seen software fail.  Typically they're among the most common types of failures, with only power failures being more common.  So, if you want your service to work reliably, something is going to have to monitor the software that implements it to see if it's working.  Software outages are estimated at around 25% of the total of unplanned outages[1].  Like most general estimates  measured against other people's computers - YMMV - Your Mileage May Vary[2] .

A sufficient condition for program triviality is that it have no bugs

So, it seems pretty obvious that this is a pretty big chunk of potential outages to watch over and try and eliminate or recover from if you can.  Of course, before you can recover from a problem you have to know there's a problem.  That's where monitoring comes in.  Monitoring software is software that monitors (or watches) other software.

Local or Network Monitoring

There are lots of packages for monitoring software available in the wild.  I tend to favor monitoring software that runs on the machine being monitored - it works even when the network is down, it minimizes the network traffic, and security concerns are greatly reduced.  But, it can be relatively painful to administer since you have to install and configure software on each machine, and you have to separately monitor the servers.  If you have thousands of servers you're monitoring in this way, it's potentially very painful (at least without the right tools for managing it centrally).  Since I like clusters, I'll mention that if you're running a cluster, this monitoring can be administered centrally on the cluster, making that much simpler than it would otherwise be - in fact it may come "for free" with when you configure the services in the cluster.

How to Monitor Services?

You have software on your server which implements services.  So, how are you going to monitor it?   There are basically two different ways to monitor software externally - the easy way, and the not-so-easy way.

  1. A really simple way to monitor a service is to just look to see if it's still running.  If you're on a UNIX-like machine, you can do a ps to see if it's still running, and if it is, say "Good service!", and if it's not, say "Bad service - Go to your room!".  If the majority of your software failures result in the process exiting (leaving a core or not), then this will work just fine for you.  This is how most init script status actions work.  However, if the software hangs, or gives crazy or otherwise incorrect results, this method won't detect that kind of misbehavior for you - you'll need a less-simple method.  It's also worth noting that, simple as this is, you have to be on the machine that's running the service.  You can't directly see the process table of a remote machine.  This technique is the basis for how the UNIX respawn init directive works - when a service exits it gets automatically restarted.  Simple, but usually too simple - which is why it only sees limited use in practice.  You could argue that good HA systems spend half of their time implementing the respawn directive right ;-).
  2. A not-so-simple-way to monitor a service is to use the service and see if the result you got was reasonable, and completed in a reasonable time.  For example, if you are monitoring a web server, you might do an HTTP GET operation on the port the server is on, and see if you get reasonable-looking HTML and a non-error return code in a "reasonable" time.  For example, the Linux-HA[3] apache[4] resource agent does exactly this using the wget[5] command.  If you have a database, you might do a short query whose answer is easily sanity-checked.  This is what the Linux-HA[3] DB2[6] resource agent does (as do several other database resource agents).  Another advantage of this technique is that you can often perform it remotely - which works out well if you want to monitor everything centrally.

Is there some kind of standard for monitoring UNIX-like services?

In fact, there are a couple of them and Linux-HA[3] implements them both.  These two standards are the LSB (Linux Standard Base[7]) and the OCF (Open Cluster Framework[8]) standards.

  • The Linux Standard Base defines a standard for init scripts[9] which tells how to start, stop, and determine the status of a service.  It's the status part that we care about here - because this implements a poor-man's simple monitoring operation.
  • The Open Cluster Framework[8] defines a standard for resource agents[10] (RAs) which tells how to start, stop, and (lucky for us) monitor a service.  In fact, the monitor action is interesting, because it defines multiple levels of monitoring.  Maybe you want to do something lightweight frequently, and something heavier-weight less often.  If so, then the OCF RA standard may be for you.  For the interested, it's simple to convert an init script into an OCF resource agent - since the OCF RA specification was built on top of the LSB init specification.

Both of these standards only work for monitoring services locally.  That is, you have to be on the server to use either.  So, you can't use them if you want to monitor services remotely.

Quis custodiet ipsos custodes?  - Who will watch the watchers?

If your monitoring software fails, how would you know?  Since it's software, and like all software, might fail, who's monitoring it to make sure it works?  As you can see, this is a potentially endless problem.  In my experience, no one has any endless solutions - only endless problems.  But, there is a reasonable way to approach it - create a hierarchy of watchers.

Applications (which can also watch themselves) are watched by a watcher program, which in turn registers with a watchdog driver.  In Linux, that can be either the softdog[11] watchdog driver or a hardware watchdog driver.  To use a watchdog driver, the watching program has to check in periodically and tickle the watchdog driver periodically.  If it doesn't, then the system reboots.  Good reason to make the top level watcher as simple as possible.  How to tickle a watchdog optimally is good fodder for a future post.

Unfortunately, the standard Linux softdog driver only allows one program at a time to use it.  Unless you want to restrict yourself to only one watcher on a system, you have to have some kind of watchdog program that the watchers register with.  The Linux-HA project provides the apphbd[12] (application heartbeat daemon) designed expressly for this purpose.  So, you register with apphbd, and apphbd registers with the watchdog driver - which watches it.  Apphbd can watch as many programs as you want - but they have to register with it, and they have to tickle it (send it heartbeats) periodically.  That means your application has to be modified to do it - which is different from the two earlier approaches.  Just like the watchdog driver, you have to tickle apphbd periodically, or it notifies someone who will take a recovery action.  Fortunately, however, apphbd doesn't reboot the machine - normally the application just gets restarted if it dies or stops sending heartbeats.  Avoiding unnecessary reboots is generally thought to be a good thing  ;-) - so most people like this a lot.

In the process of talking about how to watch watchers, we've added a third method of monitoring applications.  The first two could be done with any application, and the third only by applications you've modified.  To summarize, they are:

  1. Checking to see if the process is running (local only)
  2. Exercising an application API to check for application sanity (local or remote)
  3. "Tickling" a watchdog program or driver (typically local) - requires modifying the application.

But what if the kernel fails, or the driver fails, or the watchdog hardware fails, what do you do then?   Really there's only one thing to do - that's let another machine watch this one - and give it the ability to kill your machine using something like STONITH[13].  Voila!  You've just made an HA cluster.  How does that monitoring machine get monitored?  You let the cluster members monitor each other - resulting in a sort of mutual monitoring society.  Can this fail?  Of course! - given enough problems anything can fail - but it's really unlikely.  Trying to make this kind of problem go away completely helps you discover that paranoia can be a very expensive hobby.

This idea of hierarchical watchers is generally a good pattern to follow for monitoring.  For example, have each machine monitor its own services, and monitor the machines in a cluster, then monitor the clusters centrally.  This minimizes network traffic, and is scalable to very large environments.

I've covered the basic concepts of service monitoring, without quite getting around to talking about how to implement it in practice.  This will be covered in a future posting.

References
[1] Gregory Pfister - In Search Of Clusters - Second Edition. p. 390, 1998, Prentice Hall
[2] http://en.wiktionary.org/wiki/your_mileage_may_vary
[3] http://linux-ha.org/
[4] http://httpd.apache.org/
[5] http://www.gnu.org/software/wget/
[6] http://www.ibm.com/db2
[7] http://www.linuxbase.org/
[8] http://opencf.org/
[9] http://www.linuxbase.org/spec/refspecs/LSB_3.0.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html
[10] http://www.opencf.org/cgi-bin/viewcvs.cgi/specs/ra/resource-agent-api.txt?rev=HEAD
[11] http://en.wikibooks.org/wiki/Linux_Kernel_Drivers_Annotated/Character_Drivers/Softdog_Driver
[12] http://pwet.fr/man/linux/administration_systeme/apphbd
[13] http://linux-ha.org/STONITH or http://en.wikipedia.org/wiki/STONITH

September 07, 2007

Automated Disaster Recovery - Data Replication

This post covers the basics of data replication in automated disaster recovery.  Automated disaster recovery means that when a site providing a service goes down, the service will continue running elsewhere without human intervention.  If you're willing to trust your automation, this is an ideal situation - one site goes down (for whatever reason) and the other one takes over - automatically - and pretty quickly.  In most cases, your users get continued access to the service.  The Linux-HA software is well-prepared to do this for you correctly - even when your cluster is split across sites.  It has specific technology in it to deal with this situation - which I'll explain in a later post.

In Disaster recovery terminology, this kind of a solution provides meets both excellent recovery point objectives (RPO), and recovery time objectives (RTO).

There are a few problems with this though -- the second site has to have enough servers to be able to run the service, and of course, the software to run the service, and - the most difficult part - it has to have an up-to-date copy of the data the service uses.  This post will concentrate on this aspect of automated disaster recovery.

There are two basic approaches to replicating data across sites - application-specific, or disk volume level replication.

Application-specific Replication
A number of applications (DB2 UDB, Oracle, DHCP, DNS, etc.) have built-in (or add-on) replication mechanisms which will do a great job of keeping two copies of your data in sync - and these methods are typically the best when they're available.  However, if your application doesn't have such a method, then you'll have to use disk-volume-level replication.

Disk-volume-level replication
If you're running Linux, a good general way of replicating dynamic data copied across sites is DRBD.  DRBD is a great open source tool which keeps two copies of data in sync, and worries a lot about data integrity -- which copy of the data is up to date, and related things.  It works most efficiently when you make one side master and update from only the master side.  Typically, in a split-site design, this is typically what's needed - both for efficiency and keeping both your sanity and your data's sanity.

DRBD is only one method of doing this.  Most storage vendors (IBM for example) sell products that can replicate storage volumes across distances.  There are also a number of other commercial replication packages, and a few other open source disk replication packages.

If you are going to do automated site failover, then it's highly recommended that you replicate your data synchronously rather than asynchronously.

Synchronous replication means that before a disk write (or database transaction) is reported as being complete, that the replication software makes sure the other side has a good copy of the data on disk.  That way, no writes (or transactions) are ever reported as complete, but somehow lost after a failover.

With asynchronous replication,  the replication software ensures only that the write (or transaction) is queued to send to the other machine before the write or transaction is reported as being complete.  This means that in case of a crash, the last few writes or transactions may be lost - which may compromise your recovery point objective (RPO)..  If you fail over to the backup site, those last few transactions may be lost permanently.  This is obviously a disadvantage.  If you failed over automatically due to a communications link being down for a few minutes, those transactions were lost for what may turn out to be very little reason.  This is the kind of thing that has been known to encourage people to make out their resumes unexpectedly.

So, with asynchronous replication having rather serious limitations, why do people use it?  It's simple really - the speed of light.  If your two sites are 1000 km apart, the round trip time to send a packet, write the data, and confirm it is greater than the speed of light in fiber.   Normally people think of the speed of light as 300,000 km/sec.  However, in fibre, the speed of light is more like 195000 km/sec.  If you want your transactions to complete with no more than 1ms delay due to network delays, then you need your round trip ping time to be less than 1 ms.   This means that in the worst case, the round trip distance can't be more than 195 km (or 97 km each way).  However, with disk delays, and delays induced by store-and-forward algorithms in networking hardware, the practical rule-of-thumb distance is more like 50 km.  But, of course, this depends on how write-intensive your application is, and how much delay you can tolerate.  I'm not an expert in this area, but perhaps someone reading this can offer their opinions on it.  You probably figured out pretty much right away, that you can't get this kind of latency over the public Internet - there are just too many hops.  This means for synchronous replication that you need a dedicated (and obviously low-latency) network between your sites.

To summarize - for the things we've talked about so far, you need:

  • Two (or more) sites
  • Servers on both sites
  • Replication software or hardware
  • Duplicate storage on both sites
  • Dedicated low-latency bandwidth to connect the two sites

Of course, this isn't nearly enough for a complete solution, but it covers most of the basics for replication.

Another resource worth looking at is Christoph Miasch's thesis on Server-Based Wide Area Data Replication for Disaster Recovery