Traditionally, the way people have implemented high availability is by using a high-availability management package like Linux-HA[1], then configure it in detail for each application, file system mount, IP address and so on. This traditional method works quite well, but can be a bit labor intensive - particularly when using custom or uncommon applications. You may have to understand the structure of your applications, write some resource agents[2], debug them, and test them in detail. In addition, every time you change your mount structure, or other details you've told your HA system, you have to be sure and update your HA configuration to match - or it might not fail over correctly the next time.
When you have good resource agents, your HA system will also recover from application failures - by restarting applications that have failed. This is a good thing. On the other hand, this is enough work that virtually no one runs all their applications in an HA configuration. It's just too much work for most applications. I call this traditional boutique-like method "HA at retail". It works well, but it is a little costly to set up and maintain all the details just so.
With virtualization, another approach is possible, and (big surprise), I call it "HA at wholesale". In this paradigm, instead of needing to write scripts for each type of application, you just have one resource agent - one for managing a virtual machine. You also don't need to know the structure of the applications - the OS still starts them in whatever way it has been starting them all along. Wow, this sounds good - less work, fewer chances for errors! As expected, there is still no such thing as a free lunch here - you do wind up with some disadvantages.
For example, you can no longer easily detect the failure of an application. In addition, if an application fails, the only thing you can do about it is reboot the entire virtual machine. Inevitably, this takes longer than just restarting the failed application.
So, HA at wholesale has these properties:
[1] http://linux-ha.org/
[2] http://linux-ha.org/ResourceAgent
[3] http://www-05.ibm.com/hu/termekismertetok/xseries/dn/pfa.pdf
When you have good resource agents, your HA system will also recover from application failures - by restarting applications that have failed. This is a good thing. On the other hand, this is enough work that virtually no one runs all their applications in an HA configuration. It's just too much work for most applications. I call this traditional boutique-like method "HA at retail". It works well, but it is a little costly to set up and maintain all the details just so.
With virtualization, another approach is possible, and (big surprise), I call it "HA at wholesale". In this paradigm, instead of needing to write scripts for each type of application, you just have one resource agent - one for managing a virtual machine. You also don't need to know the structure of the applications - the OS still starts them in whatever way it has been starting them all along. Wow, this sounds good - less work, fewer chances for errors! As expected, there is still no such thing as a free lunch here - you do wind up with some disadvantages.
For example, you can no longer easily detect the failure of an application. In addition, if an application fails, the only thing you can do about it is reboot the entire virtual machine. Inevitably, this takes longer than just restarting the failed application.
So, HA at wholesale has these properties:
- Simple enough that you can implement it for every machine
- Works well for hardware failures
- When coupled with hardware predictive failure analysis[3] and smart HA software, outages can sometimes be completely avoided.
- Can't easily detect or recover from application failures
- The only thing you can do about any failure is reboot the virtual machine
- It is complex enough that you need to limit how broadly you apply it in your environment
- Works well for hardware failures
- It can easily detect and recover from application failures
- Individual applications can easily be restarted - and don't require a reboot
[1] http://linux-ha.org/
[2] http://linux-ha.org/ResourceAgent
[3] http://www-05.ibm.com/hu/termekismertetok/xseries/dn/pfa.pdf
By "When coupled with hardware predictive failure analysis[3] and smart HA software, outages can sometimes be completely avoided" do you mean having the virtual machine migrate away from hardware that is likely to fail? Is this going to be easy to do? Last time I checked your HA system only supported stopping and starting services, no direct migration.
Posted by: Russell Coker | 24 October 2008 at 15:04
You understood perfectly.
If you tell Heartbeat that a given resource supports a "migrate" action, it will use that action in preference to stop/start to migrate that resource - if it can.
So, you reserve an attribute for system health that you use in your rules, and monitor system health, and set the attribute accordingly, then, like magic, things migrate away when the health is thought to be bad.
It would be nice to have a built in series of attributes like #health-* and so on that were automatically taken into account for all resources. That would make it even easier.
This is not tied in to any particular virtualization technology, or to virtualization at all. It could be used with checkpoint/restart in an application, for example. It just has to behave like a migrate operation. Our current Xen and OpenVZ resources support it - but it's not magic - if your resource supports it, you can write an RA that supports it, and we will use it- when configured correctly.
Of course, you can't migrate a crashed resource, or migrate away from a dead machine, and there are some sensible dependency restrictions as well. [If a migratable resource depends on a non-migratable, non-clonable resource, then it can't be migrated]
Posted by: Alan R. | 24 October 2008 at 17:05
It's worth noting that Pacemaker (a child project of Linux-HA - formerly called the Linux-HA CRM) does implement this convenient type of health monitoring - that applies to every resource on the machine.
Posted by: Alan R. | 30 April 2010 at 11:32