If we let A represent availability, then the simplest formula for availability is:
A = Uptime/(Uptime + Downtime)
Of course, it's more interesting when you start looking at the things that influence uptime and downtime. The most common measures that can be used in this way are MTBF and MTTR.
MTBF is Mean Time Between Failures
MTTR is Mean Time To Repair
A = MTBF / (MTBF+MTTR)
One interesting observation you can make when reading this formula is that if you could instantly repair everything (MTTR = 0), then it wouldn't matter what the MTBF is - Availability would be 100% (1) all the time.
That's exactly what HA clustering tries to do. It tries to make the MTTR as close to zero as it can by automatically (autonomically) switching in redundant components for failed components as fast as it can. Depending on the application architecture and how fast failure can be detected and repaired, a given failure might not be observable by at all by a client of the service. If it's not observable by the client, then in some sense it didn't happen at all. This idea of viewing things from the client's perspective is an important one in a practical sense, and I'll talk about that some more later on.
It's important to realize that any given data center, or cluster provides many services, and not all of them are related to each other. Failure of one component in the system may not cause failure of the system. Indeed, good HA design eliminates single points of failure by introducing redundancy. If you're going to try and calculate MTBF in a real-life (meaning complex) environment with redundancy and interrelated services, it's going to be very complicated to do.
MTBFx is Mean Time Between Failures for entity x
MTTRx is Mean Time To Repair for entity x
Ax is the Availability of entity x
Ax = MTBFx / (MTBFx+MTTRx)
In practice, these measures (MTBFx and MTTRx) are hard to come by for nontrivial real systems - in fact, they're so tied in to application reliability and architecture, hardware architecture, deployment strategy, operational skill and training, and a whole host of other factors, that you can actually compute them only very very rarely. So, why did I spend your time talking about it? That's simple - although you probably won't compute them, you can learn some important things from these formulas, and you can see how mistakes you make in viewing these formulas might lead you to some wrong conclusions.
Let's get right into one example of a wrong conclusion you might draw from incorrectly applying these formulas.
Let's say we have a service which runs on a single machine, which you put onto a cluster composed of two computers with a certain individual MTBF (Mi) and you can fail over to the other computer ("repair") a computer in a certain repair time (Ri). With two computers, they'll fail twice as often as a single computer, so the system MTBF becomes Mi/2. If you compute the availability of the cluster, it then becomes:
A = Mi/2 / (Mi/2+Ri)
Using this (incorrect) analysis for a 1000 node cluster performing the same service, the system MTBF becomes Mi/1000.
A = Mi/1000 / (Mi/1000+Ri)
If you take the number of nodes in the cluster to the limit (approaching infinity), the Availability approaches zero.
A = 0/(0+Ri) = 0/Ri = 0
This makes it appear that adding cluster nodes decreases availability. Is this really true? Of course not! The mistake here is thinking that the service needed all those cluster nodes to make it go. If your service was a complicated interlocking scientific computation that would stop if any cluster node failed, then this model might be correct. But if the other nodes were providing redundancy or unrelated services, then they would have no effect on MTBF of the service in question. Of course, as they break, you'd have to repair them, which would mean replacing systems more and more often, which would be both annoying and expensive, but it wouldn't cause the service availability to go down.
To properly apply these formulas, even intuitively, you need to make sure you understand what your service is, how you define a failure, how the service components relate to each other, and what happens when one of them fails. Here are a few rules of thumb for thinking about availability
- Complexity is the enemy of reliability (MTTR). This can take many forms
- Complex software fails more often than simple software
- Complex hardware fails more often than simple hardware
- Software dependencies usually mean that if any component fails, the whole service fails
- Configuration complexity lowers the chances of the configuration being correct
- Complexity drastically increases the possibility of human error
- What is complex software? - Software whose model of the universe doesn't match that of the staff who manage it.
- Redundancy is the friend of availability - it allows for quick autonomic recovery - significantly improving MTTR. Replication is another word for redundancy.
- Good failure detection is vital - HA and other autonomic software can only recover from failures it detects. Undetected failures have human-speed MTTR or worse, not autonomic-speed MTTR. They can be worse than human-speed MTTR because the humans are surprised that it wasn't automatically recovered and they respond more slowly than normal. In addition, the added complexity of correcting an autonomic service and trying to keep their fingers out of the gears may slow down their thought processes.
- Non-essential components don't count - failure of inactive or non-essential components doesn't affect service availability. These inactive components can be hardware (spare machines), or software (like administrative interfaces), or hardware only being used to run non-essential software. More generally, for the purpose of calculating the availability of service X, non-essential components include anything not running service X or services essential to X.
The real world is much more complex than any simple rules of thumb like these, but these are certainly worth taking into account.
I work with a company who is just begging to dive into the world of IT automation. In actuality they had little choice as their new software applications have reeked havoc on the company’s network. They are desperate to improve application availability (http://www.stratavia.com)throughout the system mainly because the software they implemented recently is software than their clients use for their websites and as those have become extremely slow, when they’re even up and running, the time for change has come. I’m part of a team that’s been looking into new automation tools and am compiling a report that’s due by the end of this week. So far Opalis and Stratavia are looking good but I’ve got to dig up more info on both companies.
Posted by: Samantha | 20 November 2007 at 12:00
Interesting. I'm not familiar with either company, or their products, but I'll go look them up and see what they're up to.
Automation is a very hard thing to do right over a broad scope - there are many opportunities to make things worse rather than better. As you probably have gathered, my personal perspective is to approach things from the availability management perspective. Not that this is the only way, or somehow the best way. It does have the advantage of being a perspective that has largely well-proven technologies.
Posted by: Alan R. | 25 November 2007 at 22:00
Is it possible to find the probabilty of failure of a device at any time t in terms of only the known parameters like MTTR & MTBF or you can suggest me some reference. I want to use this for my doctoral research
Posted by: C.P. Gupta | 05 August 2008 at 01:07
I'm not sure about laptops or pc (although I heard Apple (MAC + Powerbooks)is very stable)I still wonder why people still talk about availability as if this is a new technology. I know that NEC has a server that is 100% redundant and only because they have to cover their legal back ends do they say it has 99.999% up time - Oh, this includes 0% downtime for Windows updates as we know should be calculated into the downtime equation. I know some companies prefer to spending a small fortune for cluster software and I guess if 99.9% up time is good (8 hours of downtime a year!!)and you don't mind paying for all the licenses etc. . .I just figure that buying one server that has a money back guarantee against crashes, one copy of the os etc - would seem as a better bargain.
Another good company that I have ran into but never tried their product personally is Marathon (marathontechnologies.com) has a unique software that is really cheap and does a fantastic job in redundant solutions.
Please understand, while cluster software has it's purposes - IT Directors need to do better research in finding complete redundant systems that are not so darn expensive and that can insure the internal components, the CPU / ram - what ever, are 100% redundant.
Posted by: Wes Tafoya | 08 September 2009 at 16:52
I spent the first 20 years of my career working for Bell Labs on exactly those kind of highly redundant systems. They've been largely abandoned largely because they are too expensive, and to get the benefit from them they need special software. Ditto for the Tandem systems - abandoned as too expensive.
Everything fails. EVERYTHING. You just have to wait long enough. Eventually the sun will burn out. The only question is what you're going to do when it fails...
Quite frankly, I think all HA cluster software (as it's been traditionally understood) is doomed. Virtualization makes redundancy and failover simple, and eventually it will make it easy - probably mainly through cloud computing.
Posted by: Alan R. | 08 September 2009 at 21:49