Another thing that I've taken for granted is that people understand how HA systems like Linux-HA or Pacemaker monitor their resources. This is important to me in the Assimilation Monitoring Project (AMP) because we use the same techniques. So this post concentrates on how typical resource agents following the Open Cluster Framework (OCF) API (which we like a lot) work. Since these OCF resource agents were written for high-availability systems, they tend to be both both flexible and robust. The OCF API is one of those things that's worked out very well - much better than expected. Both Pacemaker and AMP implement other APIs as well, but for both this is their preferred APIs.
Years ago, I created and headed a team of people from several companies and several different operating systems to create a standard API for managing high-availability resources. Resource is a vague and general concept - it's the name we used for the things HA systems manage. It might be a web server, it might be a piece of hardware, it might be an IP address, a filesystem mount, or even something as abstract as a set of firewall rules. The key thing is you have to assign a meaning to starting it, to stopping it, and you have to be able to tell if it is started (running) or not (stopped). If you can meet these criteria, then you can make an OCF resource that models it. The part we'll explore in this article is "can we tell if it's running properly or not"?. Like the concept of resource, the idea of whether something is properly running is a very flexible idea, and can mean almost whatever you want it to mean. The only real constraints are that if the start operation succeeds, then the monitor should normally succeed right after the start. Conversely, if you have issued a stop, then it should monitor as "not running" until a start is issued.
The key to monitoring anything is to monitor the right thing - whatever that might be - and to not make it too trivial. For example, looking to see if a daemon is running is usually too trivial for meaningful monitoring. It's better than nothing, but not a lot... Instead you want to monitor something that is highly correlated to the correct functioning of the resource in question.
For example, in the case of a web server, you want to be able to give it a URL to retrieve, and get the right result. You want to give it a "normal" web page - whose success is well-correlated to the success of the web server and the application behind it. But you don't want to have to continually update the monitoring script every time you update the web server, so typically such a resource would return the right successful return code (200), and matches a regular expression. For the OCF Apache resource, the default regular expression is very simple - it looks for an "</html>" at the end of the received stream. It's very simple, but if the web server is broken, normally it won't look like that. Of course, you can easily override it with some other value. This kind of flexibility is common in OCF resource agents.
Similar things apply to other resource agents. For example, in the MySQL, DB2 and Oracle resource agents, they perform a simple query - that is cheap and should always succeed. In both cases, the query is against the authorized users, and the result is counted. The result should be >= 1. If it's not, then it failed. Although it's simple, it's a real query that exercises most layers of the database system, and whose result is easily verified.
In general, you want resource agents that are cheap to run and you can run frequently. However, there are occasionally needs for more in-depth monitoring actions - things you would like to do, but are too expensive to check frequently.
There is a Filesystem resource agent that mounts a filesystem. The normal level check is something simple like checking to see if it's still mounted, and if a df command on it succeeds. This is cheap and can be run frequently. But what if that filesystem has been corrupted - by hardware failures or software bugs? A more extensive check that might be worth running once a month would be to snapshot the filesystem and peform a full fsck on the snapshot. The OCF resource agent API supports these multiple levels of monitoring through the OCF_CHECK_LEVEL parameter. The invoker of a monitoring action is permitted to say how gnarly the test is supposed to be - and if the resource agent supports that level, it will act accordingly. If it doesn't support that level, it will assume the next lowest level that it does support.
So, and OCF resource agent specification is flexible in two dimensions - it can take parameters to tell it what the resource is, and details of things like regular expressions, and also can be told how expensive a check it is expected to perform.
No matter what you want to check, it is likely that you can write an OCF resource agent to check it for you - and you can even select which check to perform at run time through the use of the OCF_CHECK_LEVEL parameter. They can be as lightweight as you want, or as "meaningful" as you want (where you get to define what meaningful means to you).
To top it all off, OCF resource agents also provide metadata describing all the paramters that they take and what they mean - potentially in multiple languages.
One other note about implementations of the OCF standard - all the ones I know about monitor resources on the machines where the resources are. For example, they execute the monitor action on a machine which is running the database. They then use other techniques to watch this watcher - as is common in reliable systems.
Since AMP implements the OCF resource agent specification (among others), this also means that it will eventually be able to use this API to start and stop resources as well as monitor them. Extensions to the API also support migrating virtual machines - and there are resource agents written for most common types of virtual machines.
Comments
You can follow this conversation by subscribing to the comment feed for this post.