Many pundits and other folks like VMware's CEO Diane Greene have
touted[1]
virtualization as being the "cure" to disaster recovery,
many for the past several years. Disaster recovery can be
pretty reasonably viewed as being high-availability over distance, so
it makes some sense to see how DR, HA and virtualization fit
together. What's hype here, and what's real? Let's look and
see what we find out.
What is virtualization?
Wikipedia[2]defines
virtualization like this:
In computing, virtualization is a broad term that
refers to the abstraction of computer resources. One useful
definition is "a technique for hiding the physical
characteristics of computing resources from the way in which other
systems, applications or end users interact with those resources".
At the moment, this term is most commonly used to refer to machine
or OS-level virtualization, storage virtualization and
something I'll call container virtualization. For now, let's
ignore storage virtualization. Machine-level
virtualization means basically making it difficult to tell exactly
which physical machine an OS is running on at any given point in
time. There are many other kinds of virtualization as well -
for example service virtualization. Service
virtualization doesn't hide or abstract physical machines, but
instead virtualizes or hides services. That is, there is no
fixed binding between services and physical machines. This is,
in fact, exactly what classic HA software does - software like
Linux-HA[3] .
What an HA system does is use service virtualization to recover from
server failures, and monitors services so that it can restart them -
either locally or on another server.
Customers care about services, not about OSes or physical
machines, so from their point of view, machine and service
virtualization are similar. Both provide the potential
for recovering from failures of physical servers and OSes.
However, since machine virtualization doesn't have any
visibility of the individual services and the dependencies between
them, it can't monitor them and restart them if they fail.
But, the ideas of virtualization need management entities to
oversee them and decide to move virtual entities, and coordinate the
movement of these virtual entities from one place to another.
VMware will soon have their Site Recovery Manager[4],
and Linux has Linux-HA[3] (among
other solutions). The purpose of such management entities is to
detect and respond to failures and administrative requests by
starting, stopping, and migrating virtual entities in their purview
to different sets of physical servers.
What does server (or
OS-level) virtualization brings to the table here?
Why would I care? If HA systems
produce a result which sounds so far like it would be similar or even superior from the customer's perspective to what OS level
virtualization does (at least from an availability perspective), then
why should I care? The answer is not found in responses to failures,
but in responses to administrative events. If you want to migrate a
running virtual machine from physical machine A to physical machine
B, many server virtualization techniques (including Xen[10] and
VMware) provide a new tool - transparent migration. In transparent
migration, the virtual machine image is migrated from physical
machine A to physical machine B while processing work continually
without appearing to stop. This is a pretty cool trick, since you
can then migrate virtual machines with apparently zero
downtime. Sadly, you can't migrate a crashed OS or an OS from a
crashed server in this way. As a result, it doesn't help at all with
unplanned outages. However, it makes it possible to take a planned
outage with very minimal impact. In a practical sense, if your
server is running full-throttle and writing on all the pages in
memory very rapidly, the migration will likely impact performance.
So, it's best to not to migrate virtual machines during peak load
without good reasons.
Can conventional HA
software manage transparent migration?
The simple answer is "mostly no". Most HA software has only three basic
operations that they perform on their resources: Start, Stop and
Monitor. If you map these operations to operate on virtual machines
as resources, start means boot the OS, stop means shut it down, and
monitor means look and see if it appears to be running correctly.
Note that each of these also operates on only one machine at a time.
A migration like described above becomes stop the virtual machine on
machine A, and then start it on machine B. This is hardly
transparent, since the stop and start operations create an outage
which will range in time from two minutes to 30 or more minutes and
completely loses all transient application state.
The problem is in the HA software's model of resources (or
services). The operations which it performs are like this
OP(resource, node) for any given OP like
start, stop
or monitor. What's
needed instead is a operation model which permits OP(resource,
fromnode, tonode). In addition, when one adds dependencies, then the
migratable resource can't depend on any non-migratable resource.
[It's actually a little more complicated than this, but it would take
too much time to explain all the details in this post).
Are
there any HA packages that can model transparent migration?
Yes. Sorry to sound a bit
like a broken record[13], but Linux-HA[3]
can do that very nicely - and to my knowledge it is the only HA
system which can. It includes support for migratable resources
like virtual machines and container virtualization, and handles the
complexities of dependencies alluded to above, along with other
potential problems not
alluded to above. So, if you use Linux-HA to manage your virtual
machines, you get full use of transparent migration without having to
figure out some monumental kludge to shoehorn something kind of like
transparent migration into your HA package. It just works. And, it
just works for containers, and resources which support checkpoint
restart. All in all, a nice feature I'd say.
What does
transparent migration have to do with disaster recovery?
Nothing. Well... Almost
nothing. Let me explain... Since transparent migration only helps
move services when the machines are up, that's not normally very
useful in a disaster - since presumably a fire or flood or power
failure, or building collapse or whatever has rather rudely and
typically unexpectedly interrupted service - precluding migration.
Why did I say almost
nothing above? Because some kinds of disasters (hurricanes, and
floods for example) often give warning in advance - so you can
use transparent migration for those cases. The other reason is that you can test
your disaster recovery without any service outages. Or rather, you
can partially test
your disaster recovery. If there are any resources (disk drives for
example) which are only needed during boot, then if those are
missing, you probably won't notice when you migrate over and back.
So, it is still desirable to periodically do non-transparent
migrations of your services to make sure things are all still working
after the latest configuration changes. However, you can migrate
over and back every weekend, and maybe once a month or once a quarter
perform non-transparent migrations just to make sure all your bases
are covered.
Can't you do this
with an HA cluster split across sites?
Yes. You can do it with only
service virtualization and a conventional HA system, or you can do it
with Linux-HA[3] and either
service or server virtualization. In either case, you have to
prepare for the replication of data[5]
across sites. Note that making sure all the data can be accessed
from multiple physical machines is part of the baseline discipline
which is imposed both by virtual machine failover and by service
failover. This discipline is a starting place for any kind of
disaster recovery. This is a valuable discipline to undertake. So, although this discipline of replicating data and OSes is a good
thing, since it's not unique to virtual machines, this part is more
hype than reality. There's nothing special about virtual machines in
this respect. In this respect, any differences that exist, don't
exist because of virtualization per se, but how easy the management tools
are to use correctly, and how comprehensive they are.
Advantages of
service virtualization
Monitors individual services
[6]
[7]
- can detect and recover from software errors that don't cause OS
crashes
Can recover individual
services (more fine-grained)
Can
recover servers and services without time for a reboot (faster)
Advantages of
Server virtualization
Best of Both
Worlds?
Linux-HA[3]
can model virtual machines as resources, and it can model normal
services (web server, etc) as resources. Can it do both at the same
time? Well... Yes and No... Doing this would require that you have
both virtual machines as resources and as cluster members at the same
time. Although you could do this, it might have unpleasant side
effects. However, you could have a cluster of virtual machines and a
cluster of real machines - and that would work - but would be two
clusters to manage. To truly get the best of both worlds, you'd have
to implement our container resource proposal[8]
or something like it. Of course, since Linux-HA is an open source
project, we'd love to see a patch to implement it ;-).
See Also
Virtualization Blog[9],
Dabcc virtualization and server management web site[12],
References
[1] http://www.byteandswitch.com/document.asp?doc_id=133643
[2]
http://en.wikipedia.org/wiki/Virtualization
[3]
http://linux-ha.org/
[4]
http://www.vmware.com/company/news/releases/srm.html
[5]
http://techthoughts.typepad.com/managing_computers/2007/09/automated-disas.html
[6]
http://techthoughts.typepad.com/managing_computers/2007/09/monitoring---a-.html
[7]
http://techthoughts.typepad.com/managing_computers/2007/09/tools-for-servi.html
[8]
http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1417
[9]
http://www.virtualization.info/
[10]
http://www.xensource.com/
[11]
http://www.vmware.com/
[12] http://www.dabcc.com/
[13] http://en.wikipedia.org/wiki/Gramophone_record