If we let A represent availability, then the simplest formula for availability is:
A = Uptime/(Uptime + Downtime)
Of course, it's more interesting when you start looking at the things that influence uptime and downtime. The most common measures that can be used in this way are MTBF and MTTR.
MTBF is Mean Time Between Failures
MTTR is Mean Time To Repair
A = MTBF / (MTBF+MTTR)
One interesting observation you can make when reading this formula is that if you could instantly repair everything (MTTR = 0), then it wouldn't matter what the MTBF is - Availability would be 100% (1) all the time.
That's exactly what HA clustering tries to do. It tries to make the MTTR as close to zero as it can by automatically (autonomically) switching in redundant components for failed components as fast as it can. Depending on the application architecture and how fast failure can be detected and repaired, a given failure might not be observable by at all by a client of the service. If it's not observable by the client, then in some sense it didn't happen at all. This idea of viewing things from the client's perspective is an important one in a practical sense, and I'll talk about that some more later on.
It's important to realize that any given data center, or cluster provides many services, and not all of them are related to each other. Failure of one component in the system may not cause failure of the system. Indeed, good HA design eliminates single points of failure by introducing redundancy. If you're going to try and calculate MTBF in a real-life (meaning complex) environment with redundancy and interrelated services, it's going to be very complicated to do.
MTBFx is Mean Time Between Failures for entity x
MTTRx is Mean Time To Repair for entity x
Ax is the Availability of entity x
Ax = MTBFx / (MTBFx+MTTRx)
In practice, these measures (MTBFx and MTTRx) are hard to come by for nontrivial real systems - in fact, they're so tied in to application reliability and architecture, hardware architecture, deployment strategy, operational skill and training, and a whole host of other factors, that you can actually compute them only very very rarely. So, why did I spend your time talking about it? That's simple - although you probably won't compute them, you can learn some important things from these formulas, and you can see how mistakes you make in viewing these formulas might lead you to some wrong conclusions.
Let's get right into one example of a wrong conclusion you might draw from incorrectly applying these formulas.
Let's say we have a service which runs on a single machine, which you put onto a cluster composed of two computers with a certain individual MTBF (Mi) and you can fail over to the other computer ("repair") a computer in a certain repair time (Ri). With two computers, they'll fail twice as often as a single computer, so the system MTBF becomes Mi/2. If you compute the availability of the cluster, it then becomes:
A = Mi/2 / (Mi/2+Ri)
Using this (incorrect) analysis for a 1000 node cluster performing the same service, the system MTBF becomes Mi/1000.
A = Mi/1000 / (Mi/1000+Ri)
If you take the number of nodes in the cluster to the limit (approaching infinity), the Availability approaches zero.
A = 0/(0+Ri) = 0/Ri = 0
This makes it appear that adding cluster nodes decreases availability. Is this really true? Of course not! The mistake here is thinking that the service needed all those cluster nodes to make it go. If your service was a complicated interlocking scientific computation that would stop if any cluster node failed, then this model might be correct. But if the other nodes were providing redundancy or unrelated services, then they would have no effect on MTBF of the service in question. Of course, as they break, you'd have to repair them, which would mean replacing systems more and more often, which would be both annoying and expensive, but it wouldn't cause the service availability to go down.
To properly apply these formulas, even intuitively, you need to make sure you understand what your service is, how you define a failure, how the service components relate to each other, and what happens when one of them fails. Here are a few rules of thumb for thinking about availability
- Complexity is the enemy of reliability (MTTR). This can take many forms
- Complex software fails more often than simple software
- Complex hardware fails more often than simple hardware
- Software dependencies usually mean that if any component fails, the whole service fails
- Configuration complexity lowers the chances of the configuration being correct
- Complexity drastically increases the possibility of human error
- What is complex software? - Software whose model of the universe doesn't match that of the staff who manage it.
- Redundancy is the friend of availability - it allows for quick autonomic recovery - significantly improving MTTR. Replication is another word for redundancy.
- Good failure detection is vital - HA and other autonomic software can only recover from failures it detects. Undetected failures have human-speed MTTR or worse, not autonomic-speed MTTR. They can be worse than human-speed MTTR because the humans are surprised that it wasn't automatically recovered and they respond more slowly than normal. In addition, the added complexity of correcting an autonomic service and trying to keep their fingers out of the gears may slow down their thought processes.
- Non-essential components don't count - failure of inactive or non-essential components doesn't affect service availability. These inactive components can be hardware (spare machines), or software (like administrative interfaces), or hardware only being used to run non-essential software. More generally, for the purpose of calculating the availability of service X, non-essential components include anything not running service X or services essential to X.
The real world is much more complex than any simple rules of thumb like these, but these are certainly worth taking into account.