Main | October 2007 »

September 2007

September 28, 2007

Virtualization as High Availability (Disaster Recovery) or High-Availability as Virtualization?

Many pundits and other folks like VMware's CEO Diane Greene have touted[1] virtualization as being the "cure" to disaster recovery, many for the past several years.  Disaster recovery can be pretty reasonably viewed as being high-availability over distance, so it makes some sense to see how DR, HA and virtualization fit together.  What's hype here, and what's real?  Let's look and see what we find out.

What is virtualization?

Wikipedia[2]defines virtualization like this:

In computing, virtualization is a broad term that refers to the abstraction of computer resources. One useful definition is "a technique for hiding the physical characteristics of computing resources from the way in which other systems, applications or end users interact with those resources".

At the moment, this term is most commonly used to refer to machine or OS-level virtualization,  storage virtualization and something I'll call container virtualization.  For now, let's ignore storage virtualization.   Machine-level virtualization means basically making it difficult to tell exactly which physical machine an OS is running on at any given point in time.  There are many other kinds of virtualization as well - for example service virtualization.   Service virtualization doesn't hide or abstract physical machines, but instead virtualizes or hides services.  That is, there is no fixed binding between services and physical machines.  This is, in fact, exactly what classic HA software does - software like Linux-HA[3] . What an HA system does is use service virtualization to recover from server failures, and monitors services so that it can restart them - either locally or on another server.

Customers care about services, not about OSes or physical machines, so from their point of view, machine and service virtualization are similar.  Both  provide the potential for recovering from failures of physical servers and OSes. However, since machine virtualization doesn't  have any visibility of the individual services and the dependencies between them, it can't monitor them and restart them if they fail.

But, the ideas of virtualization need management entities to oversee them and decide to move virtual entities, and coordinate the movement of these virtual entities from one place to another.    VMware will soon have their Site Recovery Manager[4], and Linux has Linux-HA[3] (among other solutions).  The purpose of such management entities is to detect and respond to failures and administrative requests by starting, stopping, and migrating virtual entities in their purview to different sets of physical servers.

What does server (or OS-level) virtualization brings to the table here?

Why would I care?  If HA systems produce a result which sounds so far like it would be similar or even superior from the customer's perspective to what OS level virtualization does (at least from an availability perspective), then why should I care?  The answer is not found in responses to failures, but in responses to administrative events.  If you want to migrate a running virtual machine from physical machine A to physical machine B, many server virtualization techniques (including Xen[10] and VMware) provide a new tool - transparent migration.  In transparent migration, the virtual machine image is migrated from physical machine A to physical machine B while processing work continually without appearing to stop.  This is a pretty cool trick, since you can then migrate virtual machines with apparently zero downtime.  Sadly, you can't migrate a crashed OS or an OS from a crashed server in this way.  As a result, it doesn't help at all with unplanned outages.  However, it makes it possible to take a planned outage with very minimal impact.  In a practical sense, if your server is running full-throttle and writing on all the pages in memory very rapidly, the migration will likely impact performance. So, it's best to not to migrate virtual machines during peak load without good reasons.


Can conventional HA software manage transparent migration?

The simple answer is "mostly no".  Most HA software has only three basic operations that they perform on their resources:  Start, Stop and Monitor.  If you map these operations to operate on virtual machines as resources, start means boot the OS, stop means shut it down, and monitor means look and see if it appears to be running correctly.  Note that each of these also operates on only one machine at a time. A migration like described above becomes stop the virtual machine on machine A, and then start it on machine B.  This is hardly transparent, since the stop and start operations create an outage which will range in time from two minutes to 30 or more minutes and completely loses all transient application state.

The problem is in the HA software's model of resources (or services).  The operations which it performs are like this OP(resource, node) for any given OP like start, stop or monitor.  What's needed instead is a operation model which permits OP(resource, fromnode, tonode).  In addition, when one adds dependencies, then the migratable resource can't depend on any non-migratable resource. [It's actually a little more complicated than this, but it would take too much time to explain all the details in this post).

Are there any HA packages that can model transparent migration?

Yes.  Sorry to sound a bit like a broken record[13], but Linux-HA[3] can do that very nicely - and to my knowledge it is the only HA system which can.  It includes support for migratable resources like virtual machines and container virtualization, and handles the complexities of dependencies alluded to above, along with other potential problems not alluded to above.  So, if you use Linux-HA to manage your virtual machines, you get full use of transparent migration without having to figure out some monumental kludge to shoehorn something kind of like transparent migration into your HA package.  It just works.  And, it just works for containers, and resources which support checkpoint restart.  All in all, a nice feature I'd say.

What does transparent migration have to do with disaster recovery?

Nothing.  Well... Almost nothing.  Let me explain...  Since transparent migration only helps move services when the machines are up, that's not normally very useful in a disaster - since presumably a fire or flood or power failure, or building collapse or whatever has rather rudely and typically unexpectedly interrupted service - precluding migration. Why did I say almost nothing above?  Because some kinds of disasters (hurricanes, and floods for example) often give warning in advance - so you can use transparent migration for those cases.  The other reason is that you can test your disaster recovery without any service outages.  Or rather, you can partially test your disaster recovery.  If there are any resources (disk drives for example) which are only needed during boot, then if those are missing, you probably won't notice when you migrate over and back. So, it is still desirable to periodically do non-transparent migrations of your services to make sure things are all still working after the latest configuration changes.  However, you can migrate over and back every weekend, and maybe once a month or once a quarter perform non-transparent migrations just to make sure all your bases are covered.

Can't you do this with an HA cluster split across sites?

Yes.  You can do it with only service virtualization and a conventional HA system, or you can do it with Linux-HA[3] and either service or server virtualization.  In either case, you have to prepare for the replication of data[5] across sites.  Note that making sure all the data can be accessed from multiple physical machines is part of the baseline discipline which is imposed both by virtual machine failover and by service failover.   This discipline is a starting place for any kind of disaster recovery.  This is a valuable discipline to undertake.  So, although this discipline of replicating data and OSes is a good thing, since it's not unique to virtual machines, this part is more hype than reality.  There's nothing special about virtual machines in this respect.  In this respect, any differences that exist, don't exist because of virtualization per se, but how easy the management tools are to use correctly, and how comprehensive they are.

Advantages of service virtualization

  • Monitors individual services [6] [7] - can detect and recover from software errors that don't cause OS crashes

  • Can recover individual services (more fine-grained)

  • Can recover servers and services without time for a reboot  (faster)

Advantages of Server virtualization

  • Transparent migration for administrative events is outage-free

  • All virtual machines more or less look alike (simplicity)

Best of Both Worlds?

Linux-HA[3] can model virtual machines as resources, and it can model normal services (web server, etc) as resources.  Can it do both at the same time?  Well... Yes and No...  Doing this would require that you have both virtual machines as resources and as cluster members at the same time.  Although you could do this, it might have unpleasant side effects.  However, you could have a cluster of virtual machines and a cluster of real machines - and that would work - but would be two clusters to manage.  To truly get the best of both worlds, you'd have to implement our container resource proposal[8] or something like it.  Of course, since Linux-HA is an open source project, we'd love to see a patch to implement it ;-).

See Also

Virtualization Blog[9],
Dabcc virtualization and server management web site[12],

References

[1]  http://www.byteandswitch.com/document.asp?doc_id=133643
[2] http://en.wikipedia.org/wiki/Virtualization
[3] http://linux-ha.org/
[4] http://www.vmware.com/company/news/releases/srm.html
[5] http://techthoughts.typepad.com/managing_computers/2007/09/automated-disas.html
[6] http://techthoughts.typepad.com/managing_computers/2007/09/monitoring---a-.html
[7] http://techthoughts.typepad.com/managing_computers/2007/09/tools-for-servi.html
[8] http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1417
[9] http://www.virtualization.info/
[10] http://www.xensource.com/
[11] http://www.vmware.com/
[12] http://www.dabcc.com/
[13] http://en.wikipedia.org/wiki/Gramophone_record

September 21, 2007

Tools for monitoring services

For the purposes of this posting, I'm concentrating more on things that can tell you if a particular service is working or not working and recover a failed service, and less on a datacenter-wide view of service health.

As I discussed in my previous posting on this topic[1], there are three basic ways to monitor a service:

  1. See if the process providing the service is alive
  2. Use the service API to determine if the service is well (simplified version: check to see if someone is listening on the port)
  3. Instrument the service to periodically heartbeat (check in with) a watchdog program or device

There are a lot of programs to monitor services using these three methods.  Many of the open source programs to provide these kinds of services are listed on the Linux-HA service monitoring page[2].  Rather than reproduce that page here, I'll discuss a few of the better-known programs found there. 

  • Monit[3] is very highly thought of by many sysadmins, and is often used for service monitoring and restart.  The information in this post about monit[3] was provided by the most excellent Christian Wilken.

    Monit is a monitoring tool which can take the necessary action to ensure service availability. It can monitor services locally or remotely by polling a specific port (type 2 monitoring). It can monitor binary files. I.e. you might want monit to monitor your apache binary file or any other binary on the system. It checks the md5sum and the octal permissions of the file and warns the administrator if something has changed (and unmonitor the service). It can also check the uid and gid of the file.

    Through the integrated web interface you can see uptime, cpu load and memory consumption for each service monitored. you can decide what to do when a service fails.

    Here's an example: you want to monitor apache (port 80) and if it fails you want to restart the service. If it fails to restart apache (after let's say 3 tries) you can set it up to raise an alarm (and send a mail to the admins). you can also setup the apache service monitoring to depend on apache_bin (the binary apache file). That means if the apache binary has changed the apache service is unmonitored until you take action.

    Monit also monitors its own binaries so if you (or someone else for that matter) updates the monit code you will also get an alarm and you have to take action. However, it does not monitor its own operation, or register itself with some other monitoring service like /dev/watchdog or apphbd.   Monit is a very flexible monitoring/automation tool and its config file is simple and easy to learn.  Monit is a simple yet effective tool.
  • Mon[4] is another well-thought-of monitoring tool.  It is a straightforward Perl program which has goals very similar to Monit.  Mon monitors services, and then takes actions when services are found to have failed.  For example, it will notify an administrator, or restart a service.  It comes with scripts which are capable of monitoring a large number of services.  It has been around for many years, and has many happy users.  It is simple to configure, normally monitors remotely, supports dependencies between monitors, typically is used as a type 2 monitoring service.  Like most similar services, nothing normally monitors mon's operation to see if it is still running, or operating correctly.
  • Nagios[7] is a very flexible system management package which provides for a lot of capabilities for monitoring servers and services  and taking a variety of actions when failures occur.  It is designed to watch and manage your entire data center.  For the most part, it uses SNMP to do that - however it can use other methods for monitoring as well.  One of the more interesting features of Nagios is that it supports service dependencies.  Normally, each service is defined by hand, and it can take a wide variety of actions when things fail, including escalation actions.  However, it does not natively provide process restarting.   Recently Gerard Petersen has documented[8] a way of getting nagios to do this with a consistent set of rules.  Understanding his method of doing this generically requires an extensive knowledge of Nagios and Gerard describes it as "somewhat cumbersome".  Nevertheless, it does add this capability to Nagios in a generic way.  It is a "type 1" monitoring system (done remotely with ssh) - it only checks if the process is alive.  You can tie this capability together with real service functionality checks (type 2) using dependencies.  However, if the service can't be stopped gracefully, and it has to do a kill -9 to stop it, then the service may not work properly afterwards.  This is a pretty unsurprising caveat.  However, it will attempt to kill the process and recover without manual intervention.  Nagios works remotely and performs both type 1 and type 2 checks (remotely), and implements service dependencies.

    It is worth noting that this Nagios technique will not work for servers which are part of an HA cluster, because there is no fixed association between a server name and which services are running on it.  You could use it to monitor the HA server processes if you like, but usually the HA software will do that itself.  Note that when you combine management services, you need to make sure you understand which management service is going to control what, know how each affects the other and  keep them strictly out of each other's way.

    Nagios does much more than just monitor services and restart daemons, however.  It keeps statistics, and provides historical and near-real-time performance and load graphs, provides an extensive set of alerts, and performs a variety of other system management functions. I don't know if/how the central Nagios processes are monitored.

  • apphbd[9] is part of the Linux-HA[10] suite of programs.  It performs type 3 monitoring of properly instrumented services.  That is, you can write a program which connects to apphbd, and sends it heartbeats periodically.  When your application doesn't send a heartbeat message by the expected time, or exits without informing apphbd of its intention to exit, then apphbd will notify another application of this condition, which will restart your application.  Apphbd normally registers with /dev/watchdog to send heartbeats to the kernel watchdog device.  As a precaution against false reboots, apphbd locks itself into memory and runs with soft realtime priority.

  • Heartbeat[10] from the Linux-HA project is typically thought of as a clustering package.  However, it is perfectly happy to manage a single system - which it thinks of as a cluster with one node.  It is capable of doing both type 1 and type 2 checks, implements service dependencies, and will automatically restart services if they die, including services which depend on failed service.   The configuration is in XML, and isn't difficult but can be tedious.

    For the case of monitoring a single machine, I wrote a prototype script[11] which looks at your init.d directories for all set of active services, and creates a cib.xml configuration file for the set of services on your machine.  Heartbeat will then poll to monitor those services and ensure that they are running, and restart them (and any services which depend on them) automatically.  Because the configuration this script creates relies solely on the init scripts, the monitoring is done by using the status actions of the LSB.  As a result, for the most part these are type 1 monitoring.  To do type 2 service monitoring, it is necessary to create an R2-style Heartbeat configuration and make sure the services to be monitored have a repeating monitor action declared for them.  Of course, the script mentioned before makes sure all this happens.

    If you wish to do more thorough monitoring, you can convert these scripts to OCF resource agents, modify the configuration and perform any kind of monitoring you wish.  Heartbeat itself can be monitored by SNMP or CIM - which lets you easily include Heartbeat as being monitored over the data center via Nagios, OpenNMS or other package.  In addition, if you wish to protect yourself against OS hangs and crashes and hardware problems, you can easily add servers to your one-node cluster, make some small rule changes, and now your service will continue running even if a server or OS dies.  The process of changing from a 1-node to an n-node cluster isn't at all complicated.  If one has things configured properly, and one uses the autojoin[12] feature.

    It is worth noting that the action which Heartbeat takes when a service can't be stopped is quite different from Nagios'.  In Heartbeat, when a resource (service) can't be stopped, Heartbeat would normally reboot the machine.  Although this sounds drastic, it can get most services running even when the problem is in the kernel.  Since normally Heartbeat is managing a cluster, the service would normally be taken over immediately by another node in the cluster.

    With regard to monitoring Heartbeat itself: it monitors its own operation, and it can be configured to heartbeat either with a watchdog device like softdog, or with apphbd (discussed above).  When you put it into a cluster, then the nodes of the cluster monitor each other, and use STONITH to reboot machines which have stopped working.  If you use Heartbeat then, you get as complete a set of monitoring as is available.
  • cl_respawn is a simple tool which is packaged as part of the Heartbeat[10] package.  cl_respawn takes arguments which tell it how often to heartbeat with apphbd, and the value of a "magic" exit code.  This exit code is one that when the application exits with this return code that cl_respawn does not restart the its child process.  Although it only does type 1 monitoring, it doesn't poll to see if a process has died, ithe OS gives it a signal and it responds immediately.  When you invoke a process with cl_respawn, it is necessary to make sure it does not fork off and run in the background.
  • Supervise[13] from D. J. Bernstein's daemontools[14] replaces the usual init.d with a service directory, so that things are installed, stopped and started and otherwise managed in a non-standard way.    However, as supervise is part of this whole process, service monitoring and restarting comes along for free.  When a daemon is started it becomes a child of supervise.  When a daemon dies, supervise is notified and it is automatically respawned immediately without waiting for a poll interval.  Supervise itself does not appear to be monitored for failures by svscan, which appears to be unmonitored.  Daemontools does not appear to support service dependencies.  Like cl_respawn, when you invoke a process with daemontools it is necessary to make sure it does not fork off and run in the background.
  • Watchdog drivers can either be implemented in software like softdog[15]  or they can be implemented in hardware.  All the watchdog drivers that ship with Linux share the same API - that implemented by softdog[15].  As a result, a program which works with softdog can work quite nicely with the hardware-based watchdog drivers.  It's worth noting that kernel hangs which completely disable kernel timer services can cause softdog to malfunction. The good news that hangs like this are extremely rare.   If one has a hardware based watchdog device, it takes a hardware failure for these to fail to reset a hung machine.  This too, is extremely rare.  If one is a true belt-and-suspenders sysadmin (with a enough time and budget), one could use both a software and hardware based watchdog
  • ldirectord[16] is a specialized monitoring system which is delivered as a subpackage of the Linux-HA suite.  It monitors servers and services in a load balancing cluster and takes dead servers out of the load balancing rotation, and restarts services when they stop responding.  It is a type 2 service, and because it's an HA resource, is normally monitored by Heartbeat itself.

See Also
Howtoforge articles on monitoring[17].

Related systems - not quite on topic

Ganglia[5] is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It relies on a multicast-based listen/announce protocol to monitor state within clusters and uses a tree of point-to-point connections amongst representative cluster nodes to federate clusters and aggregate their state.

Although it doesn't attempt to repair problems, it is widely used in high-performance clusters and might be of interest to people reading this post.

Opennms[6] is a Java-based enterprise grade open source network monitoring platform. It consists of a community supported open-source project as well as a commercial services, training and support organization.

The goal is for OpenNMS to be a truly distributed, scalable platform for all aspects of the FCAPS network management model, and to make this platform available to both open source and commercial applications.

OpenNMS is oriented towards responding to SNMP data and traps, but can be modified to monitor other things.  It does not appear to me to be well-suited to recovering from a service or process hanging, nor for monitoring a single system.  Where it shines is in managing an entire data center with a large number of SNMP-enabled servers and network devices.

References

[1]  http://techthoughts.typepad.com/managing_computers/2007/09/monitoring---a-.html
[2]  http://linux-ha.org/RelatedTechnologies/MonitoringSoftware
[3]  http://www.tildeslash.com/monit/
[4]  http://mon.wiki.kernel.org/index.php/Main_Page
[5]  http://ganglia.info/
[6]  http://www.opennms.org/
[7]  http://www.nagios.org/
[8]  http://www.gp-net.nl/2007/05/12/nagios-remote-eventhandling-for-init-scripts/en/
[9]  http://linux.die.net/man/8/apphbd
[10] http://linux-ha.org/
[11] http://hg.linux-ha.org/dev/raw-file/8c5da9553636/tools/1node2heartbeat
[12] http://linux-ha.org/ha.cf/AutojoinDirective
[13] http://cr.yp.to/daemontools/supervise.html
[14] http://cr.yp.to/daemontools.html
[15] http://en.wikibooks.org/wiki/Linux_Kernel_Drivers_Annotated/Character_Drivers/Softdog_Driver
[16] http://linux.die.net/man/8/ldirectord
[17] http://www.howtoforge.com/taxonomy_menu/1/59

September 15, 2007

Service Monitoring - basics of a key part of automated management

Software failures are pretty common.  If you've used software, you've seen software fail.  Typically they're among the most common types of failures, with only power failures being more common.  So, if you want your service to work reliably, something is going to have to monitor the software that implements it to see if it's working.  Software outages are estimated at around 25% of the total of unplanned outages[1].  Like most general estimates  measured against other people's computers - YMMV - Your Mileage May Vary[2] .

A sufficient condition for program triviality is that it have no bugs

So, it seems pretty obvious that this is a pretty big chunk of potential outages to watch over and try and eliminate or recover from if you can.  Of course, before you can recover from a problem you have to know there's a problem.  That's where monitoring comes in.  Monitoring software is software that monitors (or watches) other software.

Local or Network Monitoring

There are lots of packages for monitoring software available in the wild.  I tend to favor monitoring software that runs on the machine being monitored - it works even when the network is down, it minimizes the network traffic, and security concerns are greatly reduced.  But, it can be relatively painful to administer since you have to install and configure software on each machine, and you have to separately monitor the servers.  If you have thousands of servers you're monitoring in this way, it's potentially very painful (at least without the right tools for managing it centrally).  Since I like clusters, I'll mention that if you're running a cluster, this monitoring can be administered centrally on the cluster, making that much simpler than it would otherwise be - in fact it may come "for free" with when you configure the services in the cluster.

How to Monitor Services?

You have software on your server which implements services.  So, how are you going to monitor it?   There are basically two different ways to monitor software externally - the easy way, and the not-so-easy way.

  1. A really simple way to monitor a service is to just look to see if it's still running.  If you're on a UNIX-like machine, you can do a ps to see if it's still running, and if it is, say "Good service!", and if it's not, say "Bad service - Go to your room!".  If the majority of your software failures result in the process exiting (leaving a core or not), then this will work just fine for you.  This is how most init script status actions work.  However, if the software hangs, or gives crazy or otherwise incorrect results, this method won't detect that kind of misbehavior for you - you'll need a less-simple method.  It's also worth noting that, simple as this is, you have to be on the machine that's running the service.  You can't directly see the process table of a remote machine.  This technique is the basis for how the UNIX respawn init directive works - when a service exits it gets automatically restarted.  Simple, but usually too simple - which is why it only sees limited use in practice.  You could argue that good HA systems spend half of their time implementing the respawn directive right ;-).
  2. A not-so-simple-way to monitor a service is to use the service and see if the result you got was reasonable, and completed in a reasonable time.  For example, if you are monitoring a web server, you might do an HTTP GET operation on the port the server is on, and see if you get reasonable-looking HTML and a non-error return code in a "reasonable" time.  For example, the Linux-HA[3] apache[4] resource agent does exactly this using the wget[5] command.  If you have a database, you might do a short query whose answer is easily sanity-checked.  This is what the Linux-HA[3] DB2[6] resource agent does (as do several other database resource agents).  Another advantage of this technique is that you can often perform it remotely - which works out well if you want to monitor everything centrally.

Is there some kind of standard for monitoring UNIX-like services?

In fact, there are a couple of them and Linux-HA[3] implements them both.  These two standards are the LSB (Linux Standard Base[7]) and the OCF (Open Cluster Framework[8]) standards.

  • The Linux Standard Base defines a standard for init scripts[9] which tells how to start, stop, and determine the status of a service.  It's the status part that we care about here - because this implements a poor-man's simple monitoring operation.
  • The Open Cluster Framework[8] defines a standard for resource agents[10] (RAs) which tells how to start, stop, and (lucky for us) monitor a service.  In fact, the monitor action is interesting, because it defines multiple levels of monitoring.  Maybe you want to do something lightweight frequently, and something heavier-weight less often.  If so, then the OCF RA standard may be for you.  For the interested, it's simple to convert an init script into an OCF resource agent - since the OCF RA specification was built on top of the LSB init specification.

Both of these standards only work for monitoring services locally.  That is, you have to be on the server to use either.  So, you can't use them if you want to monitor services remotely.

Quis custodiet ipsos custodes?  - Who will watch the watchers?

If your monitoring software fails, how would you know?  Since it's software, and like all software, might fail, who's monitoring it to make sure it works?  As you can see, this is a potentially endless problem.  In my experience, no one has any endless solutions - only endless problems.  But, there is a reasonable way to approach it - create a hierarchy of watchers.

Applications (which can also watch themselves) are watched by a watcher program, which in turn registers with a watchdog driver.  In Linux, that can be either the softdog[11] watchdog driver or a hardware watchdog driver.  To use a watchdog driver, the watching program has to check in periodically and tickle the watchdog driver periodically.  If it doesn't, then the system reboots.  Good reason to make the top level watcher as simple as possible.  How to tickle a watchdog optimally is good fodder for a future post.

Unfortunately, the standard Linux softdog driver only allows one program at a time to use it.  Unless you want to restrict yourself to only one watcher on a system, you have to have some kind of watchdog program that the watchers register with.  The Linux-HA project provides the apphbd[12] (application heartbeat daemon) designed expressly for this purpose.  So, you register with apphbd, and apphbd registers with the watchdog driver - which watches it.  Apphbd can watch as many programs as you want - but they have to register with it, and they have to tickle it (send it heartbeats) periodically.  That means your application has to be modified to do it - which is different from the two earlier approaches.  Just like the watchdog driver, you have to tickle apphbd periodically, or it notifies someone who will take a recovery action.  Fortunately, however, apphbd doesn't reboot the machine - normally the application just gets restarted if it dies or stops sending heartbeats.  Avoiding unnecessary reboots is generally thought to be a good thing  ;-) - so most people like this a lot.

In the process of talking about how to watch watchers, we've added a third method of monitoring applications.  The first two could be done with any application, and the third only by applications you've modified.  To summarize, they are:

  1. Checking to see if the process is running (local only)
  2. Exercising an application API to check for application sanity (local or remote)
  3. "Tickling" a watchdog program or driver (typically local) - requires modifying the application.

But what if the kernel fails, or the driver fails, or the watchdog hardware fails, what do you do then?   Really there's only one thing to do - that's let another machine watch this one - and give it the ability to kill your machine using something like STONITH[13].  Voila!  You've just made an HA cluster.  How does that monitoring machine get monitored?  You let the cluster members monitor each other - resulting in a sort of mutual monitoring society.  Can this fail?  Of course! - given enough problems anything can fail - but it's really unlikely.  Trying to make this kind of problem go away completely helps you discover that paranoia can be a very expensive hobby.

This idea of hierarchical watchers is generally a good pattern to follow for monitoring.  For example, have each machine monitor its own services, and monitor the machines in a cluster, then monitor the clusters centrally.  This minimizes network traffic, and is scalable to very large environments.

I've covered the basic concepts of service monitoring, without quite getting around to talking about how to implement it in practice.  This will be covered in a future posting.

References
[1] Gregory Pfister - In Search Of Clusters - Second Edition. p. 390, 1998, Prentice Hall
[2] http://en.wiktionary.org/wiki/your_mileage_may_vary
[3] http://linux-ha.org/
[4] http://httpd.apache.org/
[5] http://www.gnu.org/software/wget/
[6] http://www.ibm.com/db2
[7] http://www.linuxbase.org/
[8] http://opencf.org/
[9] http://www.linuxbase.org/spec/refspecs/LSB_3.0.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html
[10] http://www.opencf.org/cgi-bin/viewcvs.cgi/specs/ra/resource-agent-api.txt?rev=HEAD
[11] http://en.wikibooks.org/wiki/Linux_Kernel_Drivers_Annotated/Character_Drivers/Softdog_Driver
[12] http://pwet.fr/man/linux/administration_systeme/apphbd
[13] http://linux-ha.org/STONITH or http://en.wikipedia.org/wiki/STONITH

September 12, 2007

Requested / Suggested Topics - updated

I've been thinking about what kind of content would be useful here.  Those of you who know me know that I'm happy to talk about almost anything.  But not all things are useful, and there are more useful things to talk about than I have time to talk about.  So, here are some of my thoughts from recent conversations with people and reading Russell Coker's HA blog postings[1], and a request[2] from Emily Ratliff's blog.

  • Finish the series on Disaster Recovery
  • Talk about the capabilities of the policies you can express in Linux-HA's[3] CIB
  • How repeated failures accumulate in failure stickiness
  • The importance of civility and friendliness in projects - how Linux-HA[3] has succeeded and failed in this way, and why
  • Fencing (STONITH[4]) discussion
  • Quorum and the relationship to STONITH (fencing)
  • HA and virtualization - HA as virtualization, virtualization as HA[5].
  • Linux-HA[3] capability tour
  • HA/DR training and certification
  • Single System Image - do you need it?
  • Tracking Linux-HA installs - how many people use it and how do you figure it out?

Please chime in with your thoughts and suggestions on which of these (or other) topics interest you.

PS:  Comments should be turned on for the blog now.  Sorry they weren't on at first for my previous posting.

Links:
[1] http://etbe.coker.com.au/category/ha/
[2] http://www.ratliff.net/blog/index.php/2007/09/10/alan-robertsons-new-blog-on-managing-computers/
[3] http://linux-ha.org/
[4] http://linux-ha.org/STONITH
[5] http://www.byteandswitch.com/document.asp?doc_id=133643&WT.svl=news1_3

 

September 07, 2007

Automated Disaster Recovery - Data Replication

This post covers the basics of data replication in automated disaster recovery.  Automated disaster recovery means that when a site providing a service goes down, the service will continue running elsewhere without human intervention.  If you're willing to trust your automation, this is an ideal situation - one site goes down (for whatever reason) and the other one takes over - automatically - and pretty quickly.  In most cases, your users get continued access to the service.  The Linux-HA software is well-prepared to do this for you correctly - even when your cluster is split across sites.  It has specific technology in it to deal with this situation - which I'll explain in a later post.

In Disaster recovery terminology, this kind of a solution provides meets both excellent recovery point objectives (RPO), and recovery time objectives (RTO).

There are a few problems with this though -- the second site has to have enough servers to be able to run the service, and of course, the software to run the service, and - the most difficult part - it has to have an up-to-date copy of the data the service uses.  This post will concentrate on this aspect of automated disaster recovery.

There are two basic approaches to replicating data across sites - application-specific, or disk volume level replication.

Application-specific Replication
A number of applications (DB2 UDB, Oracle, DHCP, DNS, etc.) have built-in (or add-on) replication mechanisms which will do a great job of keeping two copies of your data in sync - and these methods are typically the best when they're available.  However, if your application doesn't have such a method, then you'll have to use disk-volume-level replication.

Disk-volume-level replication
If you're running Linux, a good general way of replicating dynamic data copied across sites is DRBD.  DRBD is a great open source tool which keeps two copies of data in sync, and worries a lot about data integrity -- which copy of the data is up to date, and related things.  It works most efficiently when you make one side master and update from only the master side.  Typically, in a split-site design, this is typically what's needed - both for efficiency and keeping both your sanity and your data's sanity.

DRBD is only one method of doing this.  Most storage vendors (IBM for example) sell products that can replicate storage volumes across distances.  There are also a number of other commercial replication packages, and a few other open source disk replication packages.

If you are going to do automated site failover, then it's highly recommended that you replicate your data synchronously rather than asynchronously.

Synchronous replication means that before a disk write (or database transaction) is reported as being complete, that the replication software makes sure the other side has a good copy of the data on disk.  That way, no writes (or transactions) are ever reported as complete, but somehow lost after a failover.

With asynchronous replication,  the replication software ensures only that the write (or transaction) is queued to send to the other machine before the write or transaction is reported as being complete.  This means that in case of a crash, the last few writes or transactions may be lost - which may compromise your recovery point objective (RPO)..  If you fail over to the backup site, those last few transactions may be lost permanently.  This is obviously a disadvantage.  If you failed over automatically due to a communications link being down for a few minutes, those transactions were lost for what may turn out to be very little reason.  This is the kind of thing that has been known to encourage people to make out their resumes unexpectedly.

So, with asynchronous replication having rather serious limitations, why do people use it?  It's simple really - the speed of light.  If your two sites are 1000 km apart, the round trip time to send a packet, write the data, and confirm it is greater than the speed of light in fiber.   Normally people think of the speed of light as 300,000 km/sec.  However, in fibre, the speed of light is more like 195000 km/sec.  If you want your transactions to complete with no more than 1ms delay due to network delays, then you need your round trip ping time to be less than 1 ms.   This means that in the worst case, the round trip distance can't be more than 195 km (or 97 km each way).  However, with disk delays, and delays induced by store-and-forward algorithms in networking hardware, the practical rule-of-thumb distance is more like 50 km.  But, of course, this depends on how write-intensive your application is, and how much delay you can tolerate.  I'm not an expert in this area, but perhaps someone reading this can offer their opinions on it.  You probably figured out pretty much right away, that you can't get this kind of latency over the public Internet - there are just too many hops.  This means for synchronous replication that you need a dedicated (and obviously low-latency) network between your sites.

To summarize - for the things we've talked about so far, you need:

  • Two (or more) sites
  • Servers on both sites
  • Replication software or hardware
  • Duplicate storage on both sites
  • Dedicated low-latency bandwidth to connect the two sites

Of course, this isn't nearly enough for a complete solution, but it covers most of the basics for replication.

Another resource worth looking at is Christoph Miasch's thesis on Server-Based Wide Area Data Replication for Disaster Recovery