Many pundits and other folks like VMware's CEO Diane Greene have touted[1] virtualization as being the "cure" to disaster recovery, many for the past several years. Disaster recovery can be pretty reasonably viewed as being high-availability over distance, so it makes some sense to see how DR, HA and virtualization fit together. What's hype here, and what's real? Let's look and see what we find out.
What is virtualization?
Wikipedia[2]defines virtualization like this:
In computing, virtualization is a broad term that refers to the abstraction of computer resources. One useful definition is "a technique for hiding the physical characteristics of computing resources from the way in which other systems, applications or end users interact with those resources".
At the moment, this term is most commonly used to refer to machine or OS-level virtualization, storage virtualization and something I'll call container virtualization. For now, let's ignore storage virtualization. Machine-level virtualization means basically making it difficult to tell exactly which physical machine an OS is running on at any given point in time. There are many other kinds of virtualization as well - for example service virtualization. Service virtualization doesn't hide or abstract physical machines, but instead virtualizes or hides services. That is, there is no fixed binding between services and physical machines. This is, in fact, exactly what classic HA software does - software like Linux-HA[3] . What an HA system does is use service virtualization to recover from server failures, and monitors services so that it can restart them - either locally or on another server.
Customers care about services, not about OSes or physical machines, so from their point of view, machine and service virtualization are similar. Both provide the potential for recovering from failures of physical servers and OSes. However, since machine virtualization doesn't have any visibility of the individual services and the dependencies between them, it can't monitor them and restart them if they fail.
But, the ideas of virtualization need management entities to oversee them and decide to move virtual entities, and coordinate the movement of these virtual entities from one place to another. VMware will soon have their Site Recovery Manager[4], and Linux has Linux-HA[3] (among other solutions). The purpose of such management entities is to detect and respond to failures and administrative requests by starting, stopping, and migrating virtual entities in their purview to different sets of physical servers.
What does server (or OS-level) virtualization brings to the table here?
Why would I care? If HA systems produce a result which sounds so far like it would be similar or even superior from the customer's perspective to what OS level virtualization does (at least from an availability perspective), then why should I care? The answer is not found in responses to failures, but in responses to administrative events. If you want to migrate a running virtual machine from physical machine A to physical machine B, many server virtualization techniques (including Xen[10] and VMware) provide a new tool - transparent migration. In transparent migration, the virtual machine image is migrated from physical machine A to physical machine B while processing work continually without appearing to stop. This is a pretty cool trick, since you can then migrate virtual machines with apparently zero downtime. Sadly, you can't migrate a crashed OS or an OS from a crashed server in this way. As a result, it doesn't help at all with unplanned outages. However, it makes it possible to take a planned outage with very minimal impact. In a practical sense, if your server is running full-throttle and writing on all the pages in memory very rapidly, the migration will likely impact performance. So, it's best to not to migrate virtual machines during peak load without good reasons.
Can conventional HA
software manage transparent migration?
The simple answer is "mostly no". Most HA software has only three basic operations that they perform on their resources: Start, Stop and Monitor. If you map these operations to operate on virtual machines as resources, start means boot the OS, stop means shut it down, and monitor means look and see if it appears to be running correctly. Note that each of these also operates on only one machine at a time. A migration like described above becomes stop the virtual machine on machine A, and then start it on machine B. This is hardly transparent, since the stop and start operations create an outage which will range in time from two minutes to 30 or more minutes and completely loses all transient application state.
The problem is in the HA software's model of resources (or services). The operations which it performs are like this OP(resource, node) for any given OP like start, stop or monitor. What's needed instead is a operation model which permits OP(resource, fromnode, tonode). In addition, when one adds dependencies, then the migratable resource can't depend on any non-migratable resource. [It's actually a little more complicated than this, but it would take too much time to explain all the details in this post).
Are there any HA packages that can model transparent migration?
Yes. Sorry to sound a bit like a broken record[13], but Linux-HA[3] can do that very nicely - and to my knowledge it is the only HA system which can. It includes support for migratable resources like virtual machines and container virtualization, and handles the complexities of dependencies alluded to above, along with other potential problems not alluded to above. So, if you use Linux-HA to manage your virtual machines, you get full use of transparent migration without having to figure out some monumental kludge to shoehorn something kind of like transparent migration into your HA package. It just works. And, it just works for containers, and resources which support checkpoint restart. All in all, a nice feature I'd say.
What does transparent migration have to do with disaster recovery?
Nothing. Well... Almost nothing. Let me explain... Since transparent migration only helps move services when the machines are up, that's not normally very useful in a disaster - since presumably a fire or flood or power failure, or building collapse or whatever has rather rudely and typically unexpectedly interrupted service - precluding migration. Why did I say almost nothing above? Because some kinds of disasters (hurricanes, and floods for example) often give warning in advance - so you can use transparent migration for those cases. The other reason is that you can test your disaster recovery without any service outages. Or rather, you can partially test your disaster recovery. If there are any resources (disk drives for example) which are only needed during boot, then if those are missing, you probably won't notice when you migrate over and back. So, it is still desirable to periodically do non-transparent migrations of your services to make sure things are all still working after the latest configuration changes. However, you can migrate over and back every weekend, and maybe once a month or once a quarter perform non-transparent migrations just to make sure all your bases are covered.
Can't you do this with an HA cluster split across sites?
Yes. You can do it with only service virtualization and a conventional HA system, or you can do it with Linux-HA[3] and either service or server virtualization. In either case, you have to prepare for the replication of data[5] across sites. Note that making sure all the data can be accessed from multiple physical machines is part of the baseline discipline which is imposed both by virtual machine failover and by service failover. This discipline is a starting place for any kind of disaster recovery. This is a valuable discipline to undertake. So, although this discipline of replicating data and OSes is a good thing, since it's not unique to virtual machines, this part is more hype than reality. There's nothing special about virtual machines in this respect. In this respect, any differences that exist, don't exist because of virtualization per se, but how easy the management tools are to use correctly, and how comprehensive they are.
Advantages of service virtualization
Monitors individual services [6] [7] - can detect and recover from software errors that don't cause OS crashes
Can recover individual services (more fine-grained)
Can recover servers and services without time for a reboot (faster)
Advantages of Server virtualization
Transparent migration for administrative events is outage-free
All virtual machines more or less look alike (simplicity)
Best of Both Worlds?
Linux-HA[3] can model virtual machines as resources, and it can model normal services (web server, etc) as resources. Can it do both at the same time? Well... Yes and No... Doing this would require that you have both virtual machines as resources and as cluster members at the same time. Although you could do this, it might have unpleasant side effects. However, you could have a cluster of virtual machines and a cluster of real machines - and that would work - but would be two clusters to manage. To truly get the best of both worlds, you'd have to implement our container resource proposal[8] or something like it. Of course, since Linux-HA is an open source project, we'd love to see a patch to implement it ;-).
See Also
Virtualization Blog[9],
Dabcc virtualization and server management web site[12],
References
[1] http://www.byteandswitch.com/document.asp?doc_id=133643
[2]
http://en.wikipedia.org/wiki/Virtualization
[3]
http://linux-ha.org/
[4]
http://www.vmware.com/company/news/releases/srm.html
[5]
http://techthoughts.typepad.com/managing_computers/2007/09/automated-disas.html
[6]
http://techthoughts.typepad.com/managing_computers/2007/09/monitoring---a-.html
[7]
http://techthoughts.typepad.com/managing_computers/2007/09/tools-for-servi.html
[8]
http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1417
[9]
http://www.virtualization.info/
[10]
http://www.xensource.com/
[11]
http://www.vmware.com/
[12] http://www.dabcc.com/
[13] http://en.wikipedia.org/wiki/Gramophone_record
RHAT's Cluster Suite also supports the Xen migration feature.
Posted by: Lars Marowsky-Bree | 11 October 2007 at 08:11
>>Sadly, you can't migrate a crashed OS or an OS from a crashed server in this way. As a result, it doesn't help at all with unplanned outages.<<
Infact, VMware VMotion can (semi)transparently fail-over a virtual machine that is running on ServerA to SeverB if ServerA has an outage. This is true high availability.
Posted by: Baz | 11 March 2008 at 19:49
What you said was nothing like what you replied to. You can _reboot_ a virtual machine, or restore it from a checkpoint. But, you cannot transparently migrate a machine which has died, since transparent migration (VMware's name: vmotion) requires active cooperation of the now-dead machine.
Anyone can reboot a virtual machine - that's no trick at all. And, restoring from a snapshot - that's not interesting for most enterprise computing cases.
Posted by: Alan R. | 12 March 2008 at 14:51