« A Complete Cluster Stack for Linux | Main | Watch that basket! »

12 March 2008


Feed You can follow this conversation by subscribing to the comment feed for this post.


In general snapshotting virtual servers is a good thing - however it's NOT an alternative to clustering I agree. As you say - you loose data in case you have to roll back to the snapshot. But HA is always a concept and not a product. Combine snapshots with database archive log shipping and you have a poor man DataGuard - good thing (of course DG is better, but licensing - Oouch).

Also your servers that would normally NOT be covered by anything increasing availability can be restored faster with snapshots and virtualization (same hardware layer everywhere) - even in a different data center a 100 miles away (btw I guess this will become more and more importent when catastrophic natural disasters happen more often).

A snapshot can be the base for a restore - and for non redundant systems - more than nothing. I admit, snapshots are not an alternative to well done system design and this may include clustering, which must also be done well. I think it's dangerous of the big virtualization company's to let the people think "use virtuaization - HA included". You have to look at the IT environment as a whole and I must say SOA is a good example of this - choosen wisly alan :).

So the bottom line is - there's a rich toolset for increasing your availability - you have to look at your problem and chose the right tool from it (the book "Blueprints for High Availability" is good start here).

Saying snapshots are bad (or at least between the lines) is not a good policy and basically the same thing the virtualization companies do - but's thats just my optinion.

-- Robert

Alan R.

Snapshots are almost always worse than rebooting. You don't appear to give any counterexamples - except for something that sounds a bit like "it's a backup copy of the OS". If you know of any counterexamples, I'd like to know of them (in the normal realm of "enterprise computing"). I guess I kind of assumed that if you knew how to use virtual machine snapshots, you know how to use backup tools ;-).

I've written earlier about virtualization and HA. I guess I even forgot to put any categories on this one. I was just feeling especially ornery and contrary and just wanted to get this out while it was bugging me. I'll go back and put categories on it. And, although I might have been a little more moderate in my words today, I haven't changed my mind - and it is just my opinion ;-).

Chuck Schulz

We need to be careful of being overly critical of technologies when we discover they don't inherently implement some solution we interested in implementing. Especially when the technology was developed based on requirements which didn't have that solution in mind. Such is the case of the snapshot capabilities we see in the current population of virtualization solutions. Snapshots were developed so that the user could restore a VM to it's state at some previous point in time. This by-the-way was concieved of primarily for the benefit of the development and testing communities, and for these purposes snapshots have been very effective. Is this HA? No! Does it make snapshots a HA solution? No! Are there people trying to implement HA solutions using snapshot technology? Yes! Are these solutions based solely on snapshots? No! They augment snapshots with other features to achieve the intended operational capability. Will they be successful? Well the jury is still out on that, but it is reasonable to assume that at a minimum some will succeed, at least in a limited market such as areas which don't need and can't afford more traditional HA solutions (the kind that would for example support transaction processing for online banking).
If the intent was to criticize the wild enthusiasm that a hypervisor with snapshots will be the panacea for HA, that is justified. It clearly isn't for the reasons cited and many others. But snapshots are a technology which solves some of the issues when implementing certain important classes of HA. And the snapshot technology (like everything in the computer industry today) will continue to evolve as it struggles to meet new demands. It is naive to assume that it will just go away when it encounters the first bump in the road to HA.


Of course, it is always best to have all your data of all your machines without any loss of transactions in any place of the world available - but that's not the real world neo :)

Sometimes it's better to have a shapshot that just works, because it's the same virtual hardware on any physical hardware in the world (not bulletproof I agree). Ok with old data - instead of a badly tested, not disaster recovery prove backup strategy. A Snapshot means - "ok, at least I have some data .. lets look for a restore of the rest". When you have a snapshot available you probably have a backup agent installed already ? and you probably have most of the users and priviledges set right ? - something to start from, isn't it?

So a Snapshot is something to start from to restore your enterprise realm.

Why not use backup tools and strategies for disaster recovery - Because there are so many and nobody can handle them all, especially in enterprise environments with lots of legacy OS installations.

Virtualization with it' s unified hardware layer can achieve true disaster recovery - a DR thats handable and not just on paper. Of coure - once again - storage replication with the latest, transaction save data is the best - but still, snapshots are better than having "nothing" when your primary datacenter burned down or a backup that is 24 hours old (an even older unreliable snapstop, isn't it ?)

-- Robert

Alan R.

Hi Robert,

What I hear you saying is that you should back up your servers and their data. That's _always_ a good thing! You happen to think that virtual machine snapshots are better than those made by dedicated backup tools - which is too general a statement for me to comment on (which VM snapshot tool, which backup tool?). I certainly agree that some backups are better than none - but if you're relying on your snapshots for backups, then you've captured only a little of your data - not that from your fileservers (the best of which are dedicated, not running on virtual machines), and your database backups are probably corrupt. As you know, the real world is that databases need dedicated backup tools to get a consistent dump, and almost no one runs virtual machines for file servers. And those two things constitute the majority of key data for enterprises.

And my rant against snapshots is not against using them as a form of backup, but thinking (as some do) that you should restart from the snapshot for _failover_. The way you're talking about using them is as just another backup tool (albeit slow, probably not connected to tape archive systems, or integrated with catalogs like real backup systems). But, hey, if you've screwed up, and you haven't backed things up, and you did snapshot your machines, then _by all means_ use them.

But the impediment to make one kind of backup system or another work well less based on the technology, and more based on the willingness to deploy it and test the results frequently. That's more or less the same in either case. And, you're right, in the real world, people do forget to test their backup systems. I don't see anything about the technology that makes me think that somehow snapshot technology is going to make the staff of such an IT center suddenly more diligent (or competent) in their jobs. If I've missed the boat here, I'm confident you'll let me know ;-).

And, certainly the need to not have exactly the same hardware at the backup site as the primary site is a boon to disaster recovery. I'm not against virtualization, just against some misunderstandings of the scope of it's benefits (some seem to think it's good for whatever ails you).

Thanks for taking the time to write your comment!

FWIW: I spent 10 years managing computers in data centers, and for several of those years, backups was my baliwick - so I do know it can be done well, and I know it's much harder to do right than most people have any idea of.

Alan R.

I suppose I should have said what I thought was a good approach to consider, rather than just what I didn't like. But, that wouldn't have been as quick to write - and I have been off posting for a while and wanted to get back to it and this post just had to come out.

If you're going to have two sites connected by continuous bandwidth, like you are probably doing for snapshots, then I'd recommend a product which is meant to do this job, not using snapshots for backups. Snapshots are much more powerful than backups in their own way, but they are not backups, nor are they a complete replacement for backups. Backups of terabytes of data (nowadays meaning a handful of disk drives) as "full backups" which is what snapshots do isn't a really good plan. Backup software packages have many better ways to to this. This kind of backup you can't easily test with live data - they're always out of date.

If you want to enter into the realm of "continuous backups" (low-loss RPO, small RTO), you can use free software packages like DRBD to maintain real-time (over short distances) or near-real-time (over long distances) continuous backups of your data. And, this you really _can_ test - even live. If you combine it with something like Linux-HA (which can manage split-site configurations) then you can fail your servers and services back and forth over the site boundary and make sure this really does work. You can schedule it every weekend (or every quarter) so that it happens automatically. And, if you're running Xen or other Linux-based system as your virtualization manager, it doesn't matter if your guest OS is something like windows. If you're running a POSIX-like OS and have your /var on a separate filesystem, the OS image doesn't even change very often, so it's probably up to date most all the time - with little overhead.

To my way of thinking, the ability to test this setup easily is one of its greatest merits. When it comes to HA (and DR) - if you don't test it, it doesn't work.

And, you don't have to buy expensive technology to do it. Although Robert pointed out that you can buy DataGuard, it isn't an outstanding product. It's a solid but basic kind of HA product. Nothing flashy, nothing very interesting technically. Free products like Linux-HA are more capable, and well-equipped to deal with this kind of situation. DRBD works very well, and is used by many tens of thousands of clusters, and is well-modelled by Linux-HA - which has special modelling for data replication services like DRBD.

Whether or not you use this product or that product for the site switch or not, something like DRBD is a very nice tool to have in your tool box. You can also buy support for it from the good people at LinBit (this is how they finance their excellent work). I don't get any compensation from them - I've known them a long time and like their product a lot.

If I recall correctly I covered the basics of this kind of arrangement in an early post - but not in a lot of detail.

This advice won't be appropriate for everyone, but it is probably more likely to fit your situation than you might think. The idea of getting such good software for a low cost (or free) usually takes a while to percolate into your thinking.


snapshots are good for development environments, for systems that don't involve databases and lots of data changes, and the like.

as always, one size does not fit all.

Andrew Stone

Speaking of "continuous backups", how about continuous snapshotting? Sure the OP has an issue with 30 minute snapshots... but at what interval does it become acceptable? 5 minute? 5 second? 5 times a second? (Snapshotting is an evolving technology) I believe that as the interval decreases a larger and larger set of applications would accept snapshots as HA. Eventually you get to bank transactions which cannot lose any committed transactions. This means that your snapshot interval must be shorter then the transaction latency. Ok that might be hard :-). Also, what kind of hardware are we able to dedicate to HA? DRBD and someday "continuous" snapshotting require a lot of network bandwidth and 1 to 1 redundancy.

Alan R.

Continuous snapshotting is clearly an interesting idea. What you really want is to mirror the processes and the data to another server simultaneously, and in an ideal world, synchronously. Unfortunately, this gets progressively more expensive if you approach it as an incremental improvement on snapshotting.

However, IBM bought a company Meiosys a year or two ago that has developed continuous snapshotting technology that takes a different approach to the problem - and avoids snapshotting things it doesn't have to. Of course the interesting part is how to distinguish things you have to snapshot versus things you don't have to snapshot.

It doesn't work at the virtual operating system level, but at a lighter-weight application container level. I was naturally skeptical (I'm a paranoid HA guy, after all), but their ideas actually make sense. I don't think they're in a product yet, and I'm not sure when (or if) they will be.

The comments to this entry are closed.

Become a Fan

AddThis Social Bookmark Button
Blog powered by Typepad