Managed Virtualization Versus System Management
In an earlier post[1], I talked about a couple of kinds of virtualization, comparing two of them and highlighting their strengths. This posting discusses how virtualization can confuse and confound conventional systems management - both automated and manual, and gives some thoughts on how to deal with it.
We all know that virtualization is a GoodThing(TM). Therefore, it can't really have any disadvantages, can it? <tongue-in-cheek-off> Unfortunately, it does have disadvantages. The great strength of virtualization is its ability to break the ties between a service or operating system and the server which implements its service. Many software systems and a good number of human beings find this confusing. If I want to reboot a physical server, what services or operating systems will be disrupted by the reboot?
Conversely, if I want to do something to the machine that's running a particular service, which machine do I have to log into? If you're running both service virtualization (conventional HA like Linux-HA[2]) on top of server virtualization (ala Xen or VMware), then you have a doubly difficult task - first you have to figure out which virtual machine is running a service, then you have to figure out which physical machine is running that particular virtual machine.
This can be really annoying and can easily result in system administrators[3] making mistakes either in the middle of the night, or when under pressure (which all sysadmins know is pretty much all the time).
Remember - Complexity is the Enemy of Reliability. This is just another example of my favorite phrase at work.
And, if you want to have server monitoring software which tries to figure out whether a service is stopped and have it restart it, then it can also get confused by the fact that all these stupid servers and services are always moving around. They just won't stay put! Back in the olden days, you logged into a server and you edited the inittab, and you always knew what hardware it was running on and what server it was. Now, with virtualization, and especially with virtualization management software, you never know what's where.
A Recipe for Chaos and Conflict
Your HA software and/or your virtualization management software can move things around on you. Imagine that you have these four kinds of things in your data center:
High-Availability (HA/service-virtualization) management software
Virtualization management software
System management monitoring software
Human system administrators
This is a recipe for chaos, interspersed with the occasional career-limiting disaster. It's this kind of thing that leads system administrators to pull their hair out, and keep their resumes up to date. None of these is bad by itself, in fact, each is a GoodThing(TM). But they don't normally play well with each other. In typical myopic software design fashion, each of these layers is usually unaware of the other layers (except, of course for the last (human) layer - who has to make up for all the poor integration).
In addition, since the software layers typically aren't aware of all this wonderful virtualization going on, they can't really deal with the picture reliably. They don't know what should be happening where, because it isn't fixed. The various virtualization management packages keep changing things!
So, what's a body to do? As far as I know, there are two basic options.
Integrate the four layers of management with each other using things like CIM[4] and SNMP[5]
Empower your HA software to also manage the server virtualization of your data center
Integration of Layers
Virtually every data center (sadly, pun intended) has a variety of server types and a variety of operating systems, and a variety of management software. They mostly don't play well with each other. Almost the only way to get them to play together - even if imperfectly - is to have them talk together using industry standard protocols.
Today, that means using SNMP or CIM. Here is my personal view on the characteristics of these two protocols for your consideration.
SNMP - widely deployed - implemented in a truly compatible way, but far too weak for a job this hard. SNMP is great for grabbing statistics, checking whether a server or router is up and what kind of load it is seeing in great detail. Anything much beyond this, and the MIBs become 100% vendor-specific - meaning that cross-vendor integration breaks down - basically completely. For HA clustering or virtualization management or worse yet the combination of the two - forget it.
CIM - widely deployed in expensive disk subsystems - but rarely deployed outside that. It has newly developed models for virtualization and clustering, but like most standards they're mostly lowest-common-denominator standards, and unfortunately not widely deployed. For example, Linux-HA[2] implements CIM, but unfortunately Linux-HA has tremendous power and capability which CIM can't begin to model. So, this winds up being only possible to model using vendor-specific extensions - greatly weakening the possible integrations.
Now, I'm not saying that these two protocols are useless - far from it. Without open standards like CIM and SNMP, the prospect truly is hopeless. But I am saying that integrating them in the typical-for-the-industry highly-heterogeneous data center is a challenge, and the more layers there are to integrate, the bigger the challenge. Since standards necessarily trail industry practice, the more "bleeding edge" the topic (i.e., HA clustering or virtualization) and the more powerful the underlying tool (like Linux-HA), the greater the mismatch.
Here we have two bleeding edge topics and four layers. Yikes! Surely there must be some kind of alternative to this somewhat-unattractive mess.
Decrease The Layers and Let Them Manage Themselves
As I mentioned in my earlier virtualization posting, some HA packages (like Linux-HA) can also manage virtualization simultaneously. So, one way of dealing with this is to let (or extend) your service virtualization product also manage your server virtualization. One advantage of this approach is that service virtualization software (HA software) is comparatively mature technology, minimizing the risk.
Unfortunately, this doesn't yet go all the way in solving the problem either. There are a few things that should change to make this really work well. These include
Support much larger HA clusters - hundreds to thousands of nodes. In an ideal world, you'd really like fewer of these HA/virtualization clusters as you can get. Today you'd typically have to have one of these clusters for every 8-32 physical servers - which makes an awfully lot of these things to manage in a data center containing hundreds or thousands of servers.
Integrate with many virtualization layers - Such a product would need to integrate with Xen, IBM System Z, IBM System P, Linux KVM, VMware, and future virtualization layers like the one promised by Microsoft. This isn't rocket science, but by the time you're done, it will be some work.
Support monitoring and controlling services inside the virtual machine - Otherwise you haven't really integrated the two layers - and you wind up running some HA software inside some of the virtual machines. Again, this isn't rocket science, but it will require some work[1] for each operating system you want to manage services for.
Integrate with provisioning systems - so that you can add and delete virtual machines and allocate disk to them and their applications with fewer possibilities for error, and more automation.
None of these items are technically difficult, and none of them are prohibitively expensive to implement. Given that I'm the project leader for Linux-HA, and Linux-HA is one of the most capable HA products around, you might imagine that some of these thoughts are on my mind for our future ;-). Of course, that doesn't eliminate the necessity for integration with the remaining layers above, which is why Linux-HA implements both CIM and SNMP. This allows the virtualization management infrastructure to actively and autonomically manage servers and services, while letting it bubble up events (especially those it can't automatically recover from) to the management consoles and humans via protocols like SNMP and/or CIM.
Conclusions
Virtualization technologies add complexity to the data center along with the benefits they bring, and in the process may render the existing management facilities less than useful. However, if HA and Virtualization management are performed by a single entity, and open standards like CIM and SNMP are used, systems can be active the problems can be minimized.
See Also
Preparing for Virtual Management http://www.itbusinessedge.com/blogs/dcc/?p=276
References
[1]
http://techthoughts.typepad.com/managing_computers/2007/09/virtualization-.html
[2]
http://linux-ha.org/
[3]
http://linux-ha.org/SysAdmin
[4]
http://www.dmtf.org/standards/cim/
[4]
http://en.wikipedia.org/wiki/Common_Information_Model_%28computing%29
[5]
http://en.wikipedia.org/wiki/Simple_Network_Management_Protocol
Hi Alan,
many thanks for nice article pointing out the troubles and
needs to manage the virtualization hype :)
From what i read and understood the answer would be a pluggable
data-center management platform which is aware of monitoring,
HA and virtualization and additionally provides lots of transparency
and automatism to guide system-administrators and avoid human errors.
If your answer is "yes" than i would like to recommend to take a look
at the openQRM project -> http://www.openqrm.org
Many of your listed requirements have been already implemented in openQRM e.g.
1) support for large HA-cluster
openQRM provides an integrated resource-planning and provision system
which enables the system-administrators to create "clusters on the fly"
HA is available in 3 layers :
a) automatic hardware fail-over
b) application fail-over triggered by external or internal events
c) HA for the openQRM-server itself via one or more hot-standby's
2) integrate with many virtualization layers
openQRM provides a generic "partition engine" which conforms
different virtualization technologies within a single-management console.
Currently openQRM supports VMWare, Xen, QEMU (+KVM) and Linux-VServer
(but is not limited to that because of its open pluggable architecture)
In openQRM virtual machines are deployed and manged in the same way
as physical machines. Migrating from physical to virtual is easy
and completely transparent. That means a system administrator may decide
to run a server today on a physical system, tomorrow on a Xen-partition,
the day after tomorrow within a VMware-partition and the next day migrate it
back to a physical system without any changes on the server (-image) itself
nor with hasseling with the diffrent configurtion and disk-image files
from the different virtualization technologies.
3) Support monitoring and controlling services inside the virtual machine
Again, in openQRM virtual machines are managed and monitored in the same
way as physical systems.
4) Integrate with provisioning systems
One of openQRM base features is its provsion system. It provides a full
overview about all available resources in a data-center and how they are used.
New systems are being detected automatically and appear in the "available/idle" pool.
New virtual systems from diffrent types (VMWare, Xen, QEMU, Linux-VServer) can be
created (and removed) easily through a unified user-interface.
Also openQRM integrates with storage-infra-structure (e.g. NetApp, LVM2) to e.g.
automate creating/cloning servers (server-images) via snapshotting for rapid deployment
and/or backup/restore.
many thanks again + enjoy,
Matt R.
Posted by: Matt R. | 18 December 2007 at 04:45
Just wanted to mention Matt's participation in the Profoss event on virtualisation, where there will also be a presentation titled "HA clustering made simple with OpenVZ" and one on management tools that could add some elements to the discussion.
All info at http://www.profoss.eu/events/january-2008-virtualisation/?tab=activities
Raph
Posted by: Raphael Bauduin | 19 December 2007 at 11:46