virtualization

June 10, 2008

Watch that basket!

The computing industry has lots of trends, numerous buzzwords, and a number of hot topics.  Sometimes these are in conflict with each other, or at least start out that way...  But, in the end, there are often good ways to harmonize all these various things.

Let's wander into virtual machine territory again today.  If you have gone to the trouble to create a bunch of virtual machines, the chances are you hope to do a little server consolidation - because when that's properly done it can save you some money.

This sounds good, and indeed has lots of good things going for it.  It's buzzword compliant, it's green, it saves you green (money).  What's not to like?

To see what you might not like if this is all you do, let's take an example to make it obvious...

If you put all your virtual machines on one physical server, then if that server fails, you lose all your virtual machines.  If you put ten virtual machines on one server, then the impact of that server crashing is roughly ten times as great as if a single server crashed.    If you work at it, you might be able to consolidate the ten most critical virtual machines onto a single server - and bring your entire data center to a halt with just one crash - bringing a suddenly much more personal meaning to the term "shock and awe"

This is not typically what people are looking for in their data center - and could easily be one of those career-limiting mistakes that you'd like to avoid - unless you already have your next job lined up.

This falls under the "putting all your eggs into one basket" way of doing business.  This part of a famous quote - but not the whole quote.  Mark Twain said "Put all your eggs in the one basket and --- WATCH THAT BASKET"[1].  So, to follow Mark Twain's advice, we need to not just put our eggs into one basket, we also need to watch that basket.

As most of you already know, watching servers and services is most commonly done by high-availability software - something like Linux-HA[2].  A properly configured HA system will watch the basket for you, and keep the worst from happening to your basket, your servers or your career.

As you can see, doing virtualization for reasons of consolidation doesn't make much sense unless you also add management software (HA software or otherwise) to watch your basket of virtual machines for you.

In the end, it's easy to see that all these things are connected - virtualization, server consolidation, power savings (green computing), availability management, and you want to manage them all.

[1] http://herbison.com/herbison/broken_eggs_watch.html
[2] http://linux-ha.org/

March 12, 2008

Virtual machine snapshots considered (nearly) worthless...

With apologies to Edgar Dijkstra...

Usually when people talk about virtual machine snapshotting, they include with it snapshotting both the server and any filesystems its directly connected to.  Although this is more complex than just snapshotting the virtual machine, it isn't that hard.

This works in some very narrow technical sense for some few cases, but it involves loss of data in every case.  If you take a checkpoint every 30 minutes (or every 5 or whatever), then all the updates made during that period of time, are lost when you restore this snapshot and its storage to a consistent (but old) state.  This means that all the checks you deposited during that time, or all the bonuses your boss put you in for during that time, or the books you ordered, or whatever, are lost.  Lost to the point that they probably have to be restored manually - to the tune of great customer dissatisfaction.

In addition, if this application has connections as a client, or as a server to other servers or clients, then although the application and its immediately mounted storage are now consistent, but unless you do simultaneous snapshots between this virtual machines and all the world it is connecting with (some of which may be outside your enterprise), and then restore your entire world to this older state, then there are likely to be many client/server connections which will no longer work - because the client and server are in mutually inconsistent states.

The worst case of this is if you have a Service Oriented Architecture, where any given server is only a small part of the overall service - every service has connections to something else all the time, and to make matters worse, the clients and/or servers are often outside your own enterprise.

And, of course, don't forget that you lost transactions in the process too.  So, a reboot interval of 1 to 3 minutes sounds really good by comparision.  Because all you'll lose in that case is transactions that were not yet committed - which are many fewer than the number of transactions lost by backing up to the previous checkpoint.

As an example of a common special case where this obviously doesn't work, imagine that the server in question is a file server.  So, you restore the virtual machine and all its storage (the file server) to some older state.  Now all the connected applications which _thought_ they had committed some particular piece of work (a spreadsheet, a database transaction) - just had all that work undone.  And, depending on the file server protocol and the application, bad things will happen - certainly loss of data, and probably some of the applications will create corrupt data - since updates they thought they'd made are now gone -  unbeknownst to them.  This corrupt data can cause any number of problems - inability to make further updates, cascading application crashes - these are all possibilities.

Or what if it's a client of a file server?  The file server is a separate machine (possibly virtual, possibly real, possibly an appliance).  Then you can't put its storage state back to a known state - without restoring all its clients back to the same consistent state - and if you somehow did, then _all_ of them now suffer data loss.

Not a very pretty picture.

There are some few cases where you can isolate the application from the "real world" and snapshot the whole "mini-enterprise" in a synchronous way.  Those are mostly limited to large scale scientific applications.  Given how hard it is to make them more available in any other way, this is a good thing.  But, its a practice with narrow applicability.  After reading the paragraphs above, perhaps you can see why...

December 13, 2007

How Managed Virtualization (including HA) conflicts with System Management

Managed Virtualization Versus System Management

In an earlier post[1], I talked about a couple of kinds of virtualization, comparing two of them and highlighting their strengths.  This posting discusses how virtualization can confuse and confound conventional systems management - both automated and manual, and gives some thoughts on how to deal with it.

We all know that virtualization is a GoodThing(TM).  Therefore, it can't really have any disadvantages, can it?  <tongue-in-cheek-off> Unfortunately, it does have disadvantages.  The great strength of virtualization is its ability to break the ties between a service or operating system and the server which implements its service.  Many software systems and a good number of human beings find this confusing.  If I want to reboot a physical server, what services or operating systems will be disrupted by the reboot?

Conversely, if I want to do something to the machine that's running a particular service, which machine do I have to log into?  If you're running both service virtualization (conventional HA like Linux-HA[2]) on top of server virtualization (ala Xen or VMware), then you have a doubly difficult task - first you have to figure out which virtual machine is running a service, then you have to figure out which physical machine is running that particular virtual machine.

This can be really annoying and can easily result in system administrators[3] making mistakes either in the middle of the night, or when under pressure (which all sysadmins know is pretty much all the time).

Remember - Complexity is the Enemy of Reliability.   This is just another example of my favorite phrase at work.

And, if you want to have server monitoring software which tries to figure out whether a service is stopped and have it restart it, then it can also get confused by the fact that all these stupid servers and services are always moving around.  They just won't stay put!  Back in the olden days, you logged into a server and you edited the inittab, and you always knew what hardware it was running on and what server it was.  Now, with virtualization, and especially with virtualization management software, you never know what's where.

A Recipe for Chaos and Conflict

Your HA software and/or your virtualization management software can move things around on you.  Imagine that you have these four kinds of things in your data center:

  • High-Availability (HA/service-virtualization) management software

  • Virtualization management software

  • System management monitoring software

  • Human system administrators

This is a recipe for chaos, interspersed with the occasional career-limiting disaster. It's this kind of thing that leads system administrators to pull their hair out, and keep their resumes up to date.  None of these is bad by itself, in fact, each is a GoodThing(TM).  But they don't normally play well with each other. In typical myopic software design fashion, each of these layers is usually unaware of the other layers (except, of course for the last (human) layer - who has to make up for all the poor integration).

In addition, since the software layers typically aren't aware of all this wonderful virtualization going on, they can't really deal with the picture reliably.  They don't know what should be happening where, because it isn't fixed.  The various virtualization management packages keep changing things!

So, what's a body to do?  As far as I know, there are two basic options.

  1. Integrate the four layers of management with each other using things like CIM[4] and SNMP[5]

  2. Empower your HA software to also manage the server virtualization of your data center

Integration of Layers

Virtually every data center (sadly, pun intended) has a variety of server types and a variety of operating systems, and a variety of management software.  They mostly don't play well with each other.  Almost the only way to get them to play together - even if imperfectly - is to have them talk together using industry standard protocols.

Today, that means using SNMP or CIM.   Here is my personal view on the characteristics of these two protocols for your consideration.

  • SNMP - widely deployed - implemented in a truly compatible way, but far too weak for a job this hard.  SNMP is great for grabbing statistics, checking whether a server or router is up and what kind of load it is seeing in great detail.  Anything much beyond this, and the MIBs become 100% vendor-specific - meaning that cross-vendor integration breaks down - basically completely.  For HA clustering or virtualization management or worse yet the combination of the two - forget it.

  • CIM - widely deployed in expensive disk subsystems - but rarely deployed outside that.  It has newly developed models for virtualization and clustering, but like most standards they're mostly lowest-common-denominator standards, and unfortunately not widely deployed.  For example, Linux-HA[2] implements CIM, but unfortunately Linux-HA has tremendous power and capability which CIM can't begin to model.  So, this winds up being only possible to model using vendor-specific extensions - greatly weakening the possible integrations.

Now, I'm not saying that these two protocols are useless - far from it. Without open standards like CIM and SNMP, the prospect truly is hopeless.   But I am saying that integrating them in the typical-for-the-industry highly-heterogeneous data center is a challenge, and the more layers there are to integrate, the bigger the challenge.  Since standards necessarily trail industry practice, the more "bleeding edge" the topic (i.e., HA clustering or virtualization) and the more powerful the underlying tool (like Linux-HA), the greater the mismatch.

Here we have two bleeding edge topics and four layers.  Yikes!  Surely there must be some kind of alternative to this somewhat-unattractive mess.

Decrease The Layers and Let Them Manage Themselves

As I mentioned in my earlier virtualization posting, some HA packages (like Linux-HA) can also manage virtualization simultaneously.  So, one way of dealing with this is to let (or extend) your service virtualization product also manage your server virtualization.  One advantage of this approach is that service virtualization software (HA software) is comparatively mature technology, minimizing the risk.

Unfortunately, this doesn't yet go all the way in solving the problem either.  There are a few things that should change to make this really work well. These include

  • Support much larger HA clusters - hundreds to thousands of nodes.  In an ideal world, you'd really like fewer of these HA/virtualization clusters as you can get.  Today you'd typically have to have one of these clusters for every 8-32 physical servers - which makes an awfully lot of these things to manage in a data center containing hundreds or thousands of servers.

  • Integrate with many virtualization layers - Such a product would need to integrate with Xen, IBM System Z, IBM System P, Linux KVM, VMware, and future virtualization layers like the one promised by Microsoft.   This isn't rocket science, but by the time you're done, it will be some work.

  • Support monitoring and controlling services inside the virtual machine - Otherwise you haven't really integrated the two layers - and you wind up running some HA software inside some of the virtual machines.  Again, this isn't rocket science, but it will require some work[1] for each operating system you want to manage services for.

  • Integrate with provisioning systems - so that you can add and delete virtual machines and allocate disk to them and their applications with fewer possibilities for error, and more automation.

None of these items are technically difficult, and none of them are prohibitively expensive to implement.  Given that I'm the project leader for Linux-HA, and Linux-HA is one of the most capable HA products around, you might imagine that some of these thoughts are on my mind for our future  ;-).  Of course, that doesn't eliminate the necessity for integration with the remaining layers above, which is why Linux-HA implements both CIM and SNMP.  This allows the virtualization management infrastructure to actively and autonomically manage  servers and services, while letting it bubble up events (especially those it can't automatically recover from) to the management consoles and humans via protocols like SNMP and/or CIM.

Conclusions

Virtualization technologies add complexity to the data center along with the benefits they bring, and in the process may render the existing management facilities less than useful.  However, if HA and Virtualization management are performed by a single entity, and open standards like CIM and SNMP are used, systems can be active the problems can be minimized.

See Also

Preparing for Virtual Management http://www.itbusinessedge.com/blogs/dcc/?p=276

References

[1] http://techthoughts.typepad.com/managing_computers/2007/09/virtualization-.html
[2] http://linux-ha.org/
[3] http://linux-ha.org/SysAdmin
[4] http://www.dmtf.org/standards/cim/
[4] http://en.wikipedia.org/wiki/Common_Information_Model_%28computing%29
[5] http://en.wikipedia.org/wiki/Simple_Network_Management_Protocol

September 28, 2007

Virtualization as High Availability (Disaster Recovery) or High-Availability as Virtualization?

Many pundits and other folks like VMware's CEO Diane Greene have touted[1] virtualization as being the "cure" to disaster recovery, many for the past several years.  Disaster recovery can be pretty reasonably viewed as being high-availability over distance, so it makes some sense to see how DR, HA and virtualization fit together.  What's hype here, and what's real?  Let's look and see what we find out.

What is virtualization?

Wikipedia[2]defines virtualization like this:

In computing, virtualization is a broad term that refers to the abstraction of computer resources. One useful definition is "a technique for hiding the physical characteristics of computing resources from the way in which other systems, applications or end users interact with those resources".

At the moment, this term is most commonly used to refer to machine or OS-level virtualization,  storage virtualization and something I'll call container virtualization.  For now, let's ignore storage virtualization.   Machine-level virtualization means basically making it difficult to tell exactly which physical machine an OS is running on at any given point in time.  There are many other kinds of virtualization as well - for example service virtualization.   Service virtualization doesn't hide or abstract physical machines, but instead virtualizes or hides services.  That is, there is no fixed binding between services and physical machines.  This is, in fact, exactly what classic HA software does - software like Linux-HA[3] . What an HA system does is use service virtualization to recover from server failures, and monitors services so that it can restart them - either locally or on another server.

Customers care about services, not about OSes or physical machines, so from their point of view, machine and service virtualization are similar.  Both  provide the potential for recovering from failures of physical servers and OSes. However, since machine virtualization doesn't  have any visibility of the individual services and the dependencies between them, it can't monitor them and restart them if they fail.

But, the ideas of virtualization need management entities to oversee them and decide to move virtual entities, and coordinate the movement of these virtual entities from one place to another.    VMware will soon have their Site Recovery Manager[4], and Linux has Linux-HA[3] (among other solutions).  The purpose of such management entities is to detect and respond to failures and administrative requests by starting, stopping, and migrating virtual entities in their purview to different sets of physical servers.

What does server (or OS-level) virtualization brings to the table here?

Why would I care?  If HA systems produce a result which sounds so far like it would be similar or even superior from the customer's perspective to what OS level virtualization does (at least from an availability perspective), then why should I care?  The answer is not found in responses to failures, but in responses to administrative events.  If you want to migrate a running virtual machine from physical machine A to physical machine B, many server virtualization techniques (including Xen[10] and VMware) provide a new tool - transparent migration.  In transparent migration, the virtual machine image is migrated from physical machine A to physical machine B while processing work continually without appearing to stop.  This is a pretty cool trick, since you can then migrate virtual machines with apparently zero downtime.  Sadly, you can't migrate a crashed OS or an OS from a crashed server in this way.  As a result, it doesn't help at all with unplanned outages.  However, it makes it possible to take a planned outage with very minimal impact.  In a practical sense, if your server is running full-throttle and writing on all the pages in memory very rapidly, the migration will likely impact performance. So, it's best to not to migrate virtual machines during peak load without good reasons.


Can conventional HA software manage transparent migration?

The simple answer is "mostly no".  Most HA software has only three basic operations that they perform on their resources:  Start, Stop and Monitor.  If you map these operations to operate on virtual machines as resources, start means boot the OS, stop means shut it down, and monitor means look and see if it appears to be running correctly.  Note that each of these also operates on only one machine at a time. A migration like described above becomes stop the virtual machine on machine A, and then start it on machine B.  This is hardly transparent, since the stop and start operations create an outage which will range in time from two minutes to 30 or more minutes and completely loses all transient application state.

The problem is in the HA software's model of resources (or services).  The operations which it performs are like this OP(resource, node) for any given OP like start, stop or monitor.  What's needed instead is a operation model which permits OP(resource, fromnode, tonode).  In addition, when one adds dependencies, then the migratable resource can't depend on any non-migratable resource. [It's actually a little more complicated than this, but it would take too much time to explain all the details in this post).

Are there any HA packages that can model transparent migration?

Yes.  Sorry to sound a bit like a broken record[13], but Linux-HA[3] can do that very nicely - and to my knowledge it is the only HA system which can.  It includes support for migratable resources like virtual machines and container virtualization, and handles the complexities of dependencies alluded to above, along with other potential problems not alluded to above.  So, if you use Linux-HA to manage your virtual machines, you get full use of transparent migration without having to figure out some monumental kludge to shoehorn something kind of like transparent migration into your HA package.  It just works.  And, it just works for containers, and resources which support checkpoint restart.  All in all, a nice feature I'd say.

What does transparent migration have to do with disaster recovery?

Nothing.  Well... Almost nothing.  Let me explain...  Since transparent migration only helps move services when the machines are up, that's not normally very useful in a disaster - since presumably a fire or flood or power failure, or building collapse or whatever has rather rudely and typically unexpectedly interrupted service - precluding migration. Why did I say almost nothing above?  Because some kinds of disasters (hurricanes, and floods for example) often give warning in advance - so you can use transparent migration for those cases.  The other reason is that you can test your disaster recovery without any service outages.  Or rather, you can partially test your disaster recovery.  If there are any resources (disk drives for example) which are only needed during boot, then if those are missing, you probably won't notice when you migrate over and back. So, it is still desirable to periodically do non-transparent migrations of your services to make sure things are all still working after the latest configuration changes.  However, you can migrate over and back every weekend, and maybe once a month or once a quarter perform non-transparent migrations just to make sure all your bases are covered.

Can't you do this with an HA cluster split across sites?

Yes.  You can do it with only service virtualization and a conventional HA system, or you can do it with Linux-HA[3] and either service or server virtualization.  In either case, you have to prepare for the replication of data[5] across sites.  Note that making sure all the data can be accessed from multiple physical machines is part of the baseline discipline which is imposed both by virtual machine failover and by service failover.   This discipline is a starting place for any kind of disaster recovery.  This is a valuable discipline to undertake.  So, although this discipline of replicating data and OSes is a good thing, since it's not unique to virtual machines, this part is more hype than reality.  There's nothing special about virtual machines in this respect.  In this respect, any differences that exist, don't exist because of virtualization per se, but how easy the management tools are to use correctly, and how comprehensive they are.

Advantages of service virtualization

  • Monitors individual services [6] [7] - can detect and recover from software errors that don't cause OS crashes

  • Can recover individual services (more fine-grained)

  • Can recover servers and services without time for a reboot  (faster)

Advantages of Server virtualization

  • Transparent migration for administrative events is outage-free

  • All virtual machines more or less look alike (simplicity)

Best of Both Worlds?

Linux-HA[3] can model virtual machines as resources, and it can model normal services (web server, etc) as resources.  Can it do both at the same time?  Well... Yes and No...  Doing this would require that you have both virtual machines as resources and as cluster members at the same time.  Although you could do this, it might have unpleasant side effects.  However, you could have a cluster of virtual machines and a cluster of real machines - and that would work - but would be two clusters to manage.  To truly get the best of both worlds, you'd have to implement our container resource proposal[8] or something like it.  Of course, since Linux-HA is an open source project, we'd love to see a patch to implement it ;-).

See Also

Virtualization Blog[9],
Dabcc virtualization and server management web site[12],

References

[1]  http://www.byteandswitch.com/document.asp?doc_id=133643
[2] http://en.wikipedia.org/wiki/Virtualization
[3] http://linux-ha.org/
[4] http://www.vmware.com/company/news/releases/srm.html
[5] http://techthoughts.typepad.com/managing_computers/2007/09/automated-disas.html
[6] http://techthoughts.typepad.com/managing_computers/2007/09/monitoring---a-.html
[7] http://techthoughts.typepad.com/managing_computers/2007/09/tools-for-servi.html
[8] http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1417
[9] http://www.virtualization.info/
[10] http://www.xensource.com/
[11] http://www.vmware.com/
[12] http://www.dabcc.com/
[13] http://en.wikipedia.org/wiki/Gramophone_record